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Abstract 


‘Bayesian Methods for Statistical Analysis’ is a book on statistical 
methods for analysing a wide variety of data. The book consists of 12 
chapters, starting with basic concepts and covering numerous topics, 
including Bayesian estimation, decision theory, prediction, hypothesis 
testing, hierarchical models, Markov chain Monte Carlo methods, finite 
population inference, biased sampling and nonignorable nonresponse. 
The book contains many exercises, all with worked solutions, including 
complete computer code. It is suitable for self-study or a semester-long 
course, with three hours of lectures and one tutorial per week for 13 weeks. 
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Preface 


‘Bayesian Methods for Statistical Analysis’ is a book which can be used 
as the text for a semester-long course and is suitable for anyone who is 
familiar with statistics at the level of ‘Mathematical Statistics with 
Applications’ by Wackerly, Mendenhall and Scheaffer (2008). The book 
does not attempt to cover all aspects of Bayesian methods but to provide 
a ‘guided tour’ through the subject matter, one which naturally reflects the 
author's particular interests gained over years of research and teaching. 


For a more comprehensive account of Bayesian methods, the reader is 
referred to the very extensive literature on this subject, including ‘Theory 
of Probability’ by Jeffreys (1961), ‘Bayesian Inference in Statistical 
Analysis’ by Box and Tiao (1973), ‘Markov Chain Monte Carlo in 
Practice’ by Gilks et al. (1996), ‘Bayesian Statistics: An Introduction’ by 
Lee (1997), ‘Bayesian Methods: An Analysis for Statisticians and 
Interdisciplinary Researchers’ by Leonard and Hsu (1999), ‘Bayesian 
Data Analysis’ by Gelman et al. (2004), ‘Computational Bayesian 
Statistics’ by Bolstad (2009) and ‘Handbook of Markov Chain Monte 
Carlo’ by Brooks et al. (2011). See also Smith and Gelfand (1992) and 
O'Hagan and Forster (2004). 


The software packages which feature in this book are R and WinBUGS. 


R is a general software environment for statistical computing and graphics 
which compiles and runs on UNIX platforms, Windows and MacOS. This 
software is available for free at www.r-project.org/ Two useful guides to 
R are ‘Bayesian Computation With R’ by Albert (2009) and ‘Data 
Analysis and Graphics Using R: An Example-Based Approach' by 
Maindonald and Braun (2010). 


BUGS stands for ‘Bayesian Inference Using Gibbs Sampling’ and is a 
specialised software environment for the Bayesian analysis of complex 
statistical models using Markov chain Monte Carlo methods. WinBUGS, 
a version of BUGS for Microsoft Windows, is available for free at 
www.mrc-bsu.cam.ac.uk/software/bugs/ Two useful guides to WinBUGS 
are ‘Bayesian Modeling Using WinBUGS' by Ntzoufras (2009) and 
‘Bayesian Population Analysis Using WinBUGS’ by Kéry and Schaub 
(2012). 
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The present book includes a large number of exercises, interspersed 
throughout and each followed by a detailed solution, including complete 
computer code. A student should be able to reproduce all of the numerical 
and graphical results in the book by running the provided code. Although 
many of the exercises are straightforward, some are fairly involved, and a 
few will be of interest only to the particularly keen or advanced student. 
All of the code in this book is also available in the form of an electronic 
text document which can be obtained from the same website as the book. 


This book is in the form of an Adobe PDF file saved from Microsoft Word 
2013 documents, with the equations as MathType 6.9 objects. The figures 
in the book were created using Microsoft Paint, the Snipping Tool in 
Windows, WinBUGS and R. In the few instances where color is used, this 
is only for additional clarity. Thus, the book can be printed in black and 
white with no loss of essential information. 


The following chapter provides an overview of the book. Appendix A 
contains several additional exercises with worked solutions, Appendix B 
has selected distributions and notation, and Appendix C lists some 
abbreviations and acronyms. Following the appendices is a bibliography 
for the entire book. 


The last four of the 12 chapters in this book constitute a practical 
companion to ‘Monte Carlo Methods for Finite Population Inference’, a 
largely theoretical manuscript written by the author (Puza, 1995) during 
the last year of his employment at the Australian Bureau of Statistics in 
Canberra. 
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Overview 


Chapter l: Bayesian Basics Part | (pages 1—60) 


Introduces Bayes' rule, Bayes factors, Bayesian models, posterior 
distributions, and the proportionality formula. Also covered are the 
binomial-beta model, the Jeffreys’ famous tramcar problem, the 
distinction between finite population inference and superpopulation 
inference, conjugacy, point and interval estimation, inference on functions 
of parameters, credibility estimation, the normal-normal model, and the 
normal-gamma model. 


Chapter 2: Bayesian Basics Part 2 (pages 61—108) 


Covers the frequentist characteristics of Bayesian estimators including 
bias and coverage probabilities, mixture priors, uninformative priors 
including the Jeffreys prior, and Bayesian decision theory including the 
posterior expected loss and Bayes risk. 


Chapter 3: Bayesian Basics Part 3 (pages 109-152) 


Covers inference based on functions of the data including censoring and 
rounded data, predictive inference, posterior predictive p-values, 
multiple-parameter models, and the normal-normal-gamma model 
including an example of Bayesian finite population inference. 


Chapter 4: Computational Tools (pages 153-200) 


Covers the Newton-Raphson (NR) algorithm including its multivariate 
version, the expectation-maximisation (EM) algorithm, hybrid search 
algorithms, integration techniques including double integration, 
optimisation in R, and specification of prior distributions. 


Chapter 5: Monte Carlo Basics (pages 201—262) 


Covers Monte Carlo integration, importance sampling, the method of 
composition, Buffon's needle problem, testing the coverage of Monte 
Carlo confidence intervals, random number generation including the 
inversion technique, rejection sampling, and applications to Bayesian 
inference including prediction in the normal-normal-gamma model, Rao- 
Blackwell estimation, and estimation of posterior predictive p-values. 
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Chapter 6: MCMC Methods Part | (pages 263-320) 


Covers Markov chain Monte Carlo (MCMC) methods including the 
Metropolis-Hastings algorithm, the Gibbs sampler, specification of tuning 
parameters, the batch means method, computational issues, and 
applications to the normal-normal-gamma model. 


Chapter 7: MCMC Methods Part 2 (pages 321-364) 


Covers stochastic data augmentation, a comparison of classical and 
Bayesian methods for linear regression and logistic regression, 
respectively, and a Bayesian model for correlated Bernoulli data. 


Chapter 8: MCMC Inference via WinBUGS 
(pages 365-406) 


Provides a detailed tutorial in the WinBUGS computer package including 
running WinBUGS within R, and shows how WinBUGS can be used for 
linear regression, logistic regression and ARIMA time series analysis. 


Chapter 9: Bayesian Finite Population Theory 
(pages 407-466) 


Introduces notation and terminology for Bayesian finite population 
inference in the survey sampling context, and discusses ignorable and 
nonignorable sampling mechanisms. These concepts are illustrated by 
way of examples and exercises, some of which involve MCMC methods. 


Chapter 10: Normal Finite Population Models 
(pages 467—514) 


Contains a generalisation of the normal-normal-gamma model to the finite 
population context with covariates. Useful vector and matrix formulae are 
provided, special cases such as ratio estimation are treated in detail, and it 
is shown how MCMC methods can be used for both descriptive and 
analytic inferences. 
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Overview 


Chapter Il: Transformations and Other Topics 
(pages 515-558) 


Shows how MCMC methods can be used for inference on complicated 
functions of superpopulation and finite population quantities, as well for 
inference based on transformed data. Frequentist characteristics of 
Bayesian estimators are discussed in the finite population context, with 
examples of how Monte Carlo methods can be used to estimate model 
bias, design bias, model coverage and design coverage. 


Chapter 12: Biased Sampling and Nonresponse 
(pages 559-608) 


Discusses and provides examples of ignorable and nonignorable response 
mechanisms, with an exercise involving follow-up data. The topic of self- 
selection bias in volunteer surveys is studied from a frequentist 
perspective, then treated using Bayesian methods, and finally extended to 
the finite population context. 


Appendix A: Additional Exercises (pages 609—666) 


Provides practice at applying concepts in the last four chapters. 


Appendix B: Distributions and Notation 
(pages 667-672) 


Provides details of some distributions which feature in the book. 


Appendix C: Abbreviations and Acronyms 
(pages 673-676) 


Catalogues many of the simplified expressions used throughout. 


Computer Code in Bayesian Methods for Statistical 
Analysis 


Combines all of the R and WinBUGS code interspersed throughout the 
679-page book. This separate 126-page PDF file is available online at: 
http://eview.anu.edu.au/bayesian methods/pdf/computer code.pdf. 
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1.1 Introduction 


Bayesian methods is a term which may be used to refer to any 
mathematical tools that are useful and relevant in some way to Bayesian 
inference, an approach to statistics based on the work of Thomas Bayes 
(1701-1761). Bayes was an English mathematician and Presbyterian 
minister who is best known for having formulated a basic version of the 
well-known Bayes’ Theorem. 


Figure 1.1 (page 3) shows part of the Wikipedia article for Thomas 
Bayes. Bayes’ ideas were later developed and generalised by many 
others, most notably the French mathematician Pierre-Simon Laplace 
(1749-1827) and the British astronomer Harold Jeffreys (1891-1989). 


Bayesian inference is different to classical inference (or frequentist 
inference) mainly in that it treats model parameters as random variables 
rather than as constants. The Bayesian framework (or paradigm) allows 
for prior information to be formally taken into account. It can also be 
useful for formulating a complicated statistical model that presents a 
challenge to classical methods. 


One drawback of Bayesian inference is that it invariably requires a prior 
distribution to be specified, even in the absence of any prior information. 
However, suitable uninformative prior distributions (also known as 
noninformative, objective or reference priors) have been developed 
which address this issue, and in many cases a nice feature of Bayesian 
inference is that these priors lead to exactly the same point and interval 
estimates as does classical inference. The issue becomes even less 
important when there is at least a moderate amount of data available. As 
sample size increases, the Bayesian approach typically converges to the 
same inferential results, irrespective of the specified prior distribution. 


Another issue with Bayesian inference is that, although it may easily 
lead to suitable formulations of a challenging statistical problem, the 
types of calculation needed for inference can themselves be very 
complicated. Often, these calculations take on the form of multiple 
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integrals (or summations) which are intractable and difficult (or 
impossible) to solve, even with the aid of advanced numerical 
techniques. 


In such situations, the desired solutions can typically be approximated to 
any degree of precision using Monte Carlo (MC) methods. The idea is to 
make clever use of a large sample of values generated from a suitable 
probability distribution. 


How to generate this sample presents another problem, but one which 
can typically be solved easily via Markov chain Monte Carlo (MCMC) 
methods. Both MC and MCMC methods will feature in later chapters of 
the course. 


1.2 Bayes’ rule 


The starting point for Bayesian inference is Bayes’ rule. The simplest 
form of this is 


pa P(AP(B|A — — 

P(A)P(B| A) + P(A)P(B| A) 
where A and B are events such that P(B) » 0. This is easily proven by 
considering that: 


P(A|B)— 


EO by the definition of conditional probability 
P(AB) = P(A)P(B| A) by the multiplicative law of probability 
P(B) = P(AB) + P(AB) = P(A)P(B| A) + P(A)P(B| A) 

by the law of total probability. 


We see that the posterior probability P(A|B) is equal to the prior 
probability P(A) multiplied by a factor, where this factor is given by 
P(B| A)/ P(B). 


As regards terminology, we call P(A) the prior probability of A 
(meaning the probability of A before B is known to have occurred), and 
we call P(A|B) the posterior probability of A given B (meaning the 
probability of A after B is known to have occurred). We may also say 
that P(A) represents our a priori beliefs regarding A, and P(A|B) 
represents our a posteriori beliefs regarding A. 
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Figure |.1 Beginning of the Wikipedia article on Thomas 
Bayes 
Source: en.wikipedia.org/wiki/Thomas Bayes, 29/10/2014 


Thomas Bayes 
From Wikipedia, the free encyclopedia 


Thomas Bayes (/ berz/; c. 1701 — 7 
April 1761)" I2IInete al was an English 
statistician, philosopher and 
Presbyterian minister, known for 
having formulated a specific case of 
the theorem that bears his name: 
Bayes' theorem. Bayes never 
published what would eventually 
become his most famous 
accomplishment; his notes were 
edited and published after his death 
by Richard Price.!°] 


Thomas Bayes 


Portrait used of Bayes in the 1936 book History of 


Contents Pas Life Insurance; it is dubious whether it actually 

1 Biography depicts Bayes !"! No earlier portrait or claimed 

2 Bayes' theorem | portrait survived. 

3 Bayesianism Born c. 1701 

4 Seó also London, England 

5 Notes re Tunbridge Wels, Kent, England 

6 References Residence Tunbridge Wells, Kent, England 

7 External links Nationality English 

- Signature 

Biography [edit] gi 5 aT 


Thomas Bayes was the son of London Presbyterian minister Joshua Bayes, 
and was possibly born in Hertfordshire.?! He came from a prominent 
nonconformist family from Sheffield. In 1719, he enrolled at the University of 
Edinburgh to study logic and theology. On his return around 1722, he assisted his 
father at the latter's chapel in London before moving to Tunbridge Wells, Kent, 
around 1734. There he was minister of the Mount Sion chapel, until 1752 5! 
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More generally, we may consider any event B such that P(B) » 0 and 
k > 1 events A,..., A, which form a partition of any superset of B (such 
as the entire sample space S). Then, for any i = 1,...,k, it is true that 


P(A |B) = LE 


where P(B) - Y P(AB) and P(A,B) = P(A,)P(B|A,). 


ja 
Exercise l.l Medical testing 


The incidence of a disease in the population is 196. A medical test for the 
disease is 9096 accurate in the sense that it produces a false reading 1096 
of the time, both: (a) when the test is applied to a person with the 
disease; and (b) when the test is applied to a person without the disease. 


A person is randomly selected from population and given the test. The 
test result is positive (i.e. it indicates that the person has the disease). 


What is the probability that the person actually has the disease? 


Solution to Exercise 1.1 


Let A be the event that the person has the disease, and let B be the event 
that they test positive for the disease. Then: 
P(A)=0.01 (the prior probability of the person having the disease) 
P(B|A)-0.9 (the true positive rate, also called 
the sensitivity of the test) 
P(B|A)-0.9 (the true negative rate, also called 
the specificity of the test). 


So:  P(AB) = P(A)P(B| A) = 0.01x0.9 = 0.009 
P(AB) = P(A)P(B | A) = 0.99x 0.1 = 0.099. 


So the unconditional (or prior) probability of the person testing positive 
is P(B) = P(AB) + P(AB) = 0.009 + 0.099 = 0.108. 


So the required posterior probability of the person having the disease is 
P(A|B)= P(AB) 0.00 1 
P(B) 0.108 12 


= 0.08333. 
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Figure 1.2 is a Venn diagram which illustrates how B may be considered 


as the union of AB and AB. The required posterior probability of A 
given B is simply the probability of AB divided by the probability of B. 


Figure 1.2 Venn diagram for Exercise l.l 


[ ] 48 


gp 
[x] a5 
Discussion 


It may seem the posterior probability that the person has the disease 
(1/12) is rather low, considering the high accuracy of the test (namely 
P(B| A) =P(B| A) = 0.9). 


This may be explained by considering 1,000 random persons in the 
population and applying the test to each one. About 10 persons will have 
the disease, and of these, 9 will test positive. Of the 990 who do not have 
the disease, 99 will test positive. So the total number of persons testing 
positive will be 9 + 99 = 108, and the proportion of these 108 who 
actually have the disease will be 9/108 = 1/12. This heuristic derivation 
of the answer shows it to be small on account of the large number of 
false positives (99) amongst the overall number of positives (108). 


On the other hand, it may be noted that the posterior probability of the 
person having the disease is actually very high relative to the prior 
probability of them having the disease (P(A) — 0.01). The positive test 
result has greatly increased the person's chance of having the disease 
(increased it by more than 700%, since 0.01 + 7.333 x 0.01 = 0.08333). 
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It is instructive to generalise the answer (1/12) as a function of the 
prevalence (i.e. proportion) of the disease in the population, p = P(A), 
and the common accuracy rate of the test, q = P(B| A) = P(B | A). 


We find that 
P(A|B) = P(AP(B|A . pq l 
P(A)P(B| A)--P(A)P(B|A) pq-c(1- pX1-q) 


Figure 1.3 shows the posterior probability of the person having the 
disease (P(A| B)) as a function of p with q fixed at 0.9 and 0.95, 
respectively (subplot (a)), and as a function of q with p fixed at 0.01 and 
0.05, respectively (subplot (b)) In each case, the answer (1/12) is 
represented as a dot corresponding to p = 0.01 and q = 0.9. 


Figure |.3 Posterior probability of disease as functions of p 
and q 
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R Code for Exercise 1.1 


PAgBfun=function(p=0.01,q=0.9){ p*q / (p*q+(1-p)*(1-q)) } 
PAgBfun() #0.08333333 


pvec=seq(0,1,0.01); Pveca=PAgBfun(p=pvec,q=0.9) 
Pveca2=PAgBfun(p=pvec,q=0.95) 

qvec=seq(0,1,0.01); Pvecb=PAgBfun(p=0.01,q=qvec) 
Pvecb2=PAgBfun(p=0.05,q=qvec) 


X11(w=8,h=7); par(mfrowzc(2,1)); 


plot(pvec,Pveca,type="I",xlab="p=P(A)", ylab="P(A |B)", lwd=2) 
points(0.01,1/12,pch=16,cex=1.5); text(0.05,0.8,"(a)",cex=1.5) 
lines(pvec,Pveca2,Ityz2,Iwdz2) 

legend(0.7,0.5,c("q =0.9","q = 0.95"), Ity=c(1,2),lwd=c(2,2)) 


plot(qvec,Pvecb,type="I",xlab="q=P(B|A)=P(B'|A')",ylab="P(A]| B)",lwd=2) 
points(0.9,1/12, pch=16,cex=1.5); text(0.05,0.8,"(b)",cex=1.5) 
lines(qvec,Pvecb2,Ity=2,lwd=2) 

legend(0.2,0.8,c("p =0.01","p = 0.05"), Ity=c(1,2),lwd=c(2,2)) 


# Technical note: The graph here was copied from R as ‘bitmap’ and then 

# pasted into a Word document, which was then saved as a PDF. If the graph 
# is copied from R as ‘metafile’, it appears correct in the Word document, 

# but becomes corrupted in the PDF, with axis legends slightly off-centre. 

# So, all graphs in this book created in R were copied into Word as ‘bitmap’. 


Exercise |.2 Blood types 


In a particular population: 
10% of persons have Type 1 blood, 
and of these, 2% have a particular disease; 
30% of persons have Type 2 blood, 
and of these, 4% have the disease; 
60% of persons have Type 3 blood, 
and of these, 3% have the disease. 


A person is randomly selected from the population and found to have the 
disease. 


What is the probability that this person has Type 3 blood? 
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Solution to Exercise 1.2 


Let A = ‘The person has Type 1 blood’ 
B = ‘The person has Type 2 blood’ 
C = ‘The person has Type 3 blood’ 
D = ‘The person has the disease’. 


Then: P(A)=0.1, P(D|A)=0.02 
P(B)=0.3, P(D|B)=0.04 
P(C)=0.6, P(D|C)-0.03. 


So:  P(D)- P(AD) + P(BD) + P(CD) 
= P(A)P(D| A) + P(B)P(D|B) + P(C)P(D|C) 
= 0.1x 0.02 + 0.3x 0.04+ 0.6 0.03 
= 0.002 + 0.012 + 0.018 = 0.032 . 


P(CD) 0.018 9 
P(D) 0.032 16 


Hence: P(C | D) = = 56.25%. 


Figure 1.4 is a Venn diagram showing how D may be considered as the 
union of AD, BD and CD. The required posterior probability of C given 
D is simply the probability of CD divided by the probability of D. 


Figure |.4 Venn diagram for Exercise |.2 
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1.3 Bayes factors 


One way to perform hypothesis testing in the Bayesian framework is via 
the theory of Bayes factors. Suppose that on the basis of an observed 
event D (standing for data) we wish to test a null hypothesis 

Hat Ei 
versus an alternative hypothesis 

Has Bas 
where E, and E, are two events (which are not necessarily mutually 
exclusive or even exhaustive of the event space). 


Then we calculate: 
7, = P(E,) = the prior probability of the null hypothesis 
7, = P(E,) = the prior probability of the alternative hypothesis 
PRO = z,/ 7, = the prior odds in favour of the null hypothesis 
p, = P(E, | D) = the posterior probability of the null hypothesis 
p, = P(E, | D) = the posterior probability of the alternative hypothesis 
POO = p,/ p, = the posterior odds in favour of the null hypothesis. 


The Bayes factor is then defined as BF = POO/ PRO. This may be 
interpreted as the factor by which the data have multiplied the odds in 
favour of the null hypothesis relative to the alternative hypothesis. If 
BF » 1 then the data has increased the relative likelihood of the null, and 
if BF « 1 then the data has decreased that relative likelihood. The 
magnitude of BF tells us how much effect the data has had on the 
relative likelihood. 


Note 1: Another way to express the Bayes factor is as 
pp PV Pi _ P(Eq|D)/P(E,|D) _ P(D)P(Ey|D)/ P(E) 
7t, | 7f, P(E,)/ P(E,) P(D)P(E, | D)/ P(E,) 
— (D |E,) 


P(D| ED 


Thus, the Bayes factor may also be interpreted as the ratio of the 
likelihood of the data given the null hypothesis to the likelihood of the 
data given the alternative hypothesis. 
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Note 2: The idea of a Bayes factor extends to situations where the null 
and alternative hypotheses are statistical models rather than events. This 
idea may be taken up later. 


Exercise 1.3 Bayes factor in disease testing 


The incidence of a disease in the population is 196. A medical test for the 
disease is 9096 accurate in the sense that it produces a false reading 1096 
of the time, both: (a) when the test is applied to a person with the 
disease; and (b) when the test is applied to a person without the disease. 


A person is randomly selected from population and given the test. The 
test result is positive (i.e. it indicates that the person has the disease). 


Calculate the Bayes factor for testing that the person has the disease 
versus that they do not have the disease. 


Solution to Exercise 1.3 


Recall in Exercise 1.1, where A = ‘Person has disease’ and B = ‘Person 
tests positive’, the relevant probabilities are P(A) 20.01, P(B| A) 20.9 


and P(B|A)=0.9, from which can be deduced that P(A|B)=1/12. 


We now wish to test H,: A vs H,: A. So we calculate: 
7, = P(A) = 0.01, z, = P(A) = 0.99, PRO = z,/ 7, = 1/99, 
p, = P(A|B)= 1/12, p, = P(A|B)- 11/12, POO = p,/ p,= 1/11. 


Hence the required Bayes factor is BF = POO/PRO = (1/11)/(1/99) = 9. 


This means the positive test result has multiplied the odds of the person 
having the disease relative to not having it by a factor of 9 or 900%. 
Another way to say this is that those odds have increased by 800%. 


Note: We could also work out the Bayes factor here as 
SENECAM 
P(B|A) 0.1 
namely as the ratio of the probability that the person tests positive given 
they have the disease to the probability that they test positive given they 
do not have the disease. 


3 
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1.4 Bayesian models 


Bayes’ formula extends naturally to statistical models. A Bayesian 
model is a parametric model in the classical (or frequentist) sense, but 
with the addition of a prior probability distribution for the model 
parameter, which is treated as a random variable rather than an unknown 
constant. The basic components of a Bayesian model may be listed as: 
* the data, denoted by y 
* the parameter, denoted by 0 
* the model distribution, given by a specification of 
f(y|0) or F(y|0) or the distribution of (y |8) 
* the prior distribution, given by a specification of 
f (0) or F(0) or the distribution of 0. 


Here, F is a generic symbol which denotes cumulative distribution 
function (cdf), and f is a generic symbol which denotes probability 
density function (pdf) (when applied to a continuous random variable) or 
probability mass function (pmf) (when applied to a discrete random 
variable). For simplicity, we will avoid the term pmf and use the term 
pdf or density for all types of random variable, including the mixed type. 


Note 1: A mixed distribution is defined by a cdf which exhibits at least 
one discontinuity (or jump) and is strictly increasing over at least one 
interval of values. 


Note 2: The prior may be specified by writing a statement of the form 
*0—..', where the symbol ‘~’ means ‘is distributed as’, and where 
*...' denotes the relevant distribution. Likewise, the model for the data 
may be specified by writing a statement of the form ‘(y|0)~...’. 


Note 3: At this stage we will not usually distinguish between y as a 
random variable and y as a value of that random variable; but sometimes 
we may use Y for the former. Each of y and 0 may be a scalar, vector, 
matrix or array. Also, each component of y and 0 may have a discrete 
distribution, a continuous distribution, or a mixed distribution. 


In the first few examples below, we will focus on the simplest case 
where both y and 0 are scalar and discrete. 
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1.5 The posterior distribution 


Bayesian inference requires determination of the posterior probability 
distribution of 0 . This task is equivalent to finding the posterior pdf of 
0 , which may be done using the equation 


f (8) f Cy |0) 
0 = SS à 
f(0]y) FO) 


Here, f(y) is the unconditional (or prior) pdf of y, as given by 
f f(0)f(y|0)d0 if 0is continuous 
= 0)dF (0) = 
f= f FOID) Y FOFO) if Bis discrete. 
0 


Note: Here, f f (y | 0)dF(0) is a Lebesgue-Stieltjes integral, which may 


need evaluating by breaking the integral into two parts in the case where 
0 has a mixed distribution. In the continuous case, think of dF(0) as 


dF(0) ,, 
~y P= f()a0. 


Exercise 1.4 Loaded dice 


Consider six loaded dice with the following properties. Die A has 
probability 0.1 of coming up 6, each of Dice B and C has probability 0.2 
of coming up 6, and each of Dice D, E and F has probability 0.3 of 
coming up 6. 


A die is chosen randomly from the six dice and rolled twice. On both 
occasions, 6 comes up. 


What is the posterior probability distribution of 0, the probability of 6 
coming up on the chosen die. 


Solution to Exercise 1.4 


Let y be the number of times that 6 comes up on the two rolls of the 
chosen die, and let @ be the probability of 6 coming up on a single roll 
of that die. Then the Bayesian model is: 
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(y |8) ~ Bin(2,0) 
1/6, 0-04 
f(0)2412/6, 0-02 
3/6, 0-03. 


In this case y - 2 and so 


2 2 
f (y|0) | Jea-or -Í Jea-o 2-9. 
y 2 


So f(y)= »3 f (0) f Cy | 0) -Žo +20.) +203) = 0.06. 


" P (1/6)0.1? /0.06 = 0.02778, 0-01 
So f(6| 5: 14019 - (2/6)0.22/0.06 0.22222, 0-02 
Fo) (3/6)0.37/ 0.06 =0.75, 0-0. 


Note: This result means that if the chosen die were to be tossed again a 
large number of times (say 10,000) then there is a 75% chance that 6 
would come up about 30% of the time, a 22.2% chance that 6 would 
come up about 20% of the time, and a 2.8% chance that 6 would come 
up about 10% of the time. 


1.6 The proportionality formula 


Observe that f(y) is a constant with respect to 0 in the Bayesian 
equation 
f (6|y) — f(8) fCy|0)/ FO), 
which means that we may also write the equation as 
roy- TO 
k 
or as 


f (6| y) — cf (0) f Cy |0) , 
where k= f(y) and c —1/k. 


We may also write 


f (0| y) e f (O) f Cy 0), 


where œ is the proportionality sign. 
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Equivalently, we may write 


fly FO f (10) 


to emphasise that the proportionality is specifically with respect to 0. 


Another way to express the last equation is 

f (6| y) c f(8)x L(0] y), 
where L(0|y) is the likelihood function (defined as the model 
density f(y|@) multiplied by any constant with respect to 0, and 
viewed as a function of 0 rather than of y). 


The last equation may also be stated in words as: 
The posterior is proportional to the prior times the likelihood. 


These observations indicate a shortcut method for determining the 
required posterior distribution which obviates the need for calculating 
f(y) (which may be difficult). 


This method is to multiply the prior density (or the kernel of that 
density) by the likelihood function and try to identify the resulting 
function of 0 as the density of a well-known or common distribution. 


Once the posterior distribution has been identified, f(y) may then be 
obtained easily as the associated normalising constant. 


Exercise 1.5 Loaded dice with solution via the proportionality 
formula 


As in Exercise 1.4, suppose that Die A has probability 0.1 of coming up 
6, each of Dice B and C has probability 0.2 of coming up 6, and each of 
Dice D, E and F has probability 0.3 of coming up 6. 


A die is chosen randomly from the six dice and rolled twice. On both 
occasions, 6 comes up. 


Using the proportionality formula, find the posterior probability 
distribution of 0, the probability of 6 coming up on the chosen die. 
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Solution to Exercise 1.5 


With y denoting the number of times 6 comes up, the Bayesian model 
may be written: 


J 

f(y 18) | jea-or. y =0,1,2 
y 

f (0) 2100/6,0 =0.1,0.2,0.3. 


Note: 100/6 = 1/6, 2/6 and 3/6 for 0 = 0.1, 0.2 and 0.3, respectively. 


Hence f(@|y) x f(0) f (y | 0) 
108 2 ya 2-y 
(e (1-6) 


oc 0x0^  sincey-2. 


0.P 21/1000, 0 - 0.1 1,0201 
Thus f(0|y)o 6? 24 0.22 28/1000, 0 20.2 1 c4 8,8202 
0.3 227/1000,0-0.3| |27,0=0.3. 


P /36 = 0.02778, 0 20.1 
Now, 1+8+27=36,andso f(0|y)-42?/36-0.22222,0 20.2 
33/36 = 0.75, 0 2 0.3, 


which is the same result as obtained earlier in Exercise 1.4. 


Exercise 1.6 Buses 


You are visiting a town with buses whose licence plates show their 
numbers consecutively from 1 up to however many there are. In your 
mind the number of buses could be anything from one to five, with all 
possibilities equally likely. 


Whilst touring the town you first happen to see Bus 3. 
Assuming that at any point in time you are equally likely to see any of 


the buses in the town, how likely is it that the town has at least four 
buses? 
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Solution to Exercise 1.6 


Let 0 be the number of buses in the town and let y be the number of the 
bus that you happen to first see. Then an appropriate Bayesian model is: 
f(y|0) 21/0, y Z1,..,0 
f(0)=1/5, 0 =1,...,5 (prior): 


Note: We could also write this model as: 

DEDI 

oD). 
where DU denotes the discrete uniform distribution. (See Appendix B.9 
for details regarding this distribution. Appendix B also provides details 
regarding some other important distributions that feature in this book.) 


So the posterior density of @ is 
f (0| y) ec f(8) f Cy |0) 
&IxI/D, 823.5. 


Noting that y = 3, we have that 
1/3,0 3 
f(8|y)o41/4,024 
1/5,0 5. 


Now, 1/3+1/4+41/5=(20+15+12)/60 = 47/60, and so 


13. 20. 
47/60 47° 
1/4 15 

0 = = —, = 
F1») 47/60 47 
li. do 
47/60 47' 


So the posterior probability that the town has at least four buses is 


P(024|y)- 2: f(@ly)= f(8-4|y)* f(8-5|y) 
20 27 _ 
-1- f(-3|y) =1- = = 0.5745. 
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This exercise is a variant of the famous ‘tramcar problem’ considered by 
Harold Jeffreys in his book Theory of Probability and previously 
suggested to him by M.H.A. Newman (see Jeffreys, 1961, page 238). 
Suppose that before entering the town you had absolutely no idea about 
the number of buses @. Then, according to Jeffreys’ logic, a prior which 
may be considered as suitably uninformative (or noninformative) in this 
situation is given by f (0)o1/0, 0 21,2,3,.... 


Now, this prior density is problematic because it is improper (since 
311/0 — oo). However, it leads to a proper posterior density given by 
1 


0|y)- 5, 0=3,4,5...., 
HEIDE E 
2 
where tg = 0.394934. 
3 4 5 6 ir 2 


So, under this alternative prior, the probability of there being at least 
four buses in the town (given that you have seen Bus 3) works out as 


P(024|y)-1- P(0 23| y) =1-— = 0.7187. 
c 
The logic which Jeffreys used to come up with the prior f(@) <1/@ in 
relation to the tramcar problem will be discussed further in Chapter 2. 


R Code for Exercise 1.6 


options(digits=6); c=(1/6)*(pi*2)-5/4; c # 0.394934 
1- (1/3^2)/c # 0.718659 


Exercise 1.7 Balls in a box 


In each of nine indistinguishable boxes there are nine balls, the ith box 
having i red balls and 9 — i white balls (i = 1,...,9). 


One box is selected randomly from the nine, and then three balls are 
chosen randomly from the selected box (without replacement and 
without looking at the remaining balls in the box). 


Exactly two of the three chosen balls are red. Find the probability that 
the selected box has at least four red balls remaining in it. 
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Solution to Exercise 1.7 


Let N= the number of balls in each box (9) 
n = the number of balls chosen from the selected box (3) 
0 =the number of red balls initially in the selected box 
(1,2,...,8 OF 9) 
y = the number of red balls amongst the n chosen balls (2). 


Then an appropriate Bayesian model is: 
(y|0)~ Hyp(N,0,n) (Hypergeometric with parameters 
N, 0 andn, and having mean n 0 /N) 
0 ~ DU(1..N) (discrete uniform over the integers 1,2,...,N). 


For this model, the posterior density of @ is 


1 (0 (N-80) /(N 
TODE roroa- TID 


ON — 0)! 
96 ————— 0-y,.,N-(n-y). 
(0-KN-0-(m-y) | ^? een 


In our case, 
f (8| y) «c 


or more simply, 
f(0|y)«0(0—1(9-0), 20-2,.,8. 


0*(9— 0)! 


(0—2) (9-6 - (3-2)! 0 -2,..,9- (3-2), 


14,0 22 
36,0 23 
60,0 =4 

Thus f(0|y)o480,0-5;sk(0), 
90,0=6 
84,0=7 
56,028 

where 


8 
c= M k(0) 214 * 36 +... +56 = 420. 


0-1 
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14/420 = 0.03333, 0 = 2 
36/ 420 = 0.08571, 0 =3 
60/ 420 = 0.14286, 0 = 4 
80/420 = 0.19048, 0 2 5 
90/420 = 0.21429,0 - 6 
84/ 420 = 0.20000, 0-7 
56/420 = 0.13333, 0 — 8. 


k 
So f(61) - 5 - 


The probability that the selected box has at least four red balls remaining 
is the posterior probability that 0 (the number of red balls initially in the 
box) is at least 6 (since two red balls have already been taken out of the 
box). So the required probability is 


Bosse p zm = = 0.5476. 


R Code for Exercise 1.7 


tv=2:8; kv=tv*(tv-1)*(9-tv); c=sum(kv); c 420 
options(digits=4); cbind(tv,kv,kv/c,cumsum(kv/c)) 
#[1,] 2 14 0.03333 0.03333 

#([2,] 3 36 0.08571 0.11905 

#[3,] 4 60 0.14286 0.26190 

#[4,] 5 80 0.19048 0.45238 

#[5,] 6 90 0.21429 0.66667 

#[6,] 7 84 0.20000 0.86667 

#[7,] 8 56 0.13333 1.00000 


23/42 tt 0.5476 
1-0.45238 # 0.5476 (alternative calculation of the required probability) 
sum((kv/c)[tv>=6]) # 0.5476 

# (yet another calculation of the required probability) 


1.7 Continuous parameters 


The examples above have all featured a target parameter which is 
discrete. The following example illustrates Bayesian inference involving 
a continuous parameter. This case presents no new problems, except that 
the prior and posterior densities of the parameter may no longer be 
interpreted directly as probabilities. 
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Exercise 1.8 The binomial-beta model (or beta-binomial model) 


Consider the following Bayesian model: 
(y|0) ~ Binomial(n,0) 
0 ~ Beta(o,9) (prior). 


Find the posterior distribution of 0 . 


Solution to Exercise 1.8 


The posterior density is 
f (0| y) cx f (0) f Cy |0) 
| 0*!0-0)* 
B(a, B) 
o 0^7 (1— 0)" x0'(1—0)" (ignoring constants which 
do not depend on @ ) 
Bea. gyem»* gg, 


n 
d ra-or* 
y 


This is the kernel of the beta density with parameters a+y and 
8 4- n— y . It follows that the posterior distribution of 0 is given by 
(8| y) ~ Beta(a 4- y, B 4- n— y), 
and the posterior density of 0 is (exactly) 
geo 1—0 (8--n—y)-1 
f@ly)= D U 


,0«0«1. 
B(a -- y, 8 +n- y) 


For example, suppose that a = 6 = 1, that is, 0 ~ Beta(1,1). 
o (- 9g)" 
B(1,1) 
Thus the prior may also be expressed by writing 0 ^ U(0,1). 


Then the prior density is f (0) = =1,0<0<1. 


Also, suppose that n=2. Then there are three possible values of y, 
namely 0, 1 and 2, and these lead to the following three posteriors, 
respectively: 

(0| y) ~ Beta(1+0,1+ 2— 0) = Beta(1,3) 

(0| y) ~ Beta(14-1,14- 2 — 1) = Beta(2,2) 

(0| y) ~ Beta(14- 2,14- 2 — 2) = Beta(3,1) . 


These three posteriors and the prior are illustrated in Figure 1.5. 


20 


Chapter |: Bayesian Basics Part | 


Note: The prior here may be considered uninformative because it is 
‘flat’ over the entire range of possible values for 0 , namely 0 to 1. This 
prior was originally used by Thomas Bayes and is often called the Bayes 
prior. However, other uninformative priors have been proposed for the 
binomial parameter 0 . These will be discussed later, in Chapter 2. 


Figure 1.5 The prior and three posteriors in Exercise 1.8 
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R Code for Exercise 1.8 

X11(w=8,h=5); par(mfrowzc(1,1)); 
plot(c(0,1),c(0,3),type="n",xlab="theta",ylab="density") 
lines(c(0,1),c(1,1),lty=1,lwd=3); tv=seq(0,1,0.01) 
lines(tv,3*(1-tv)42,lty=2,lwd=3) 

lines(tv,3*2*tv*(1-tv), lty=3,lwd=3) 


lines(tv,3*tv^2,Ityz4,Iwd-z3) 


legend(0.3,3,c("prior", "posterior if yzO"," posterior if yz1"," posterior if y=2"), 
Ity=c(1,2,3,4),lwd=rep(2,4)) 
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1.8 Finite and infinite population inference 


In the last example (Exercise 1.8), with the model: 

(y | @) ~ Binomial(n,0) 

0 ~ Beta(a, 8), 
the quantity of interest @ is the probability of success on a single 
Bernoulli trial. 


This quantity may be thought of as the average of a hypothetically 
infinite number of Bernoulli trials. For that reason we may refer to 
derivation of the posterior distribution, 

(0| y) ~ Beta(o:4- y, 8 - n— y), 
as infinite population inference. 


In contrast, for the ‘buses’ example further above (Exercise 1.6), which 
involves the model: 

f(y|0) 21/0, y Z1,..,0 

f(0)=1/5, 0 =1,...,5, 
the quantity of interest 0 represents the number of buses in a population 
of buses, which of course is finite. 


Therefore derivation of the posterior, 


20/47,0 23 
f(0| y)=415/47,0=4 
12/47,0 —5, 


may be termed finite population inference. 


Another example of finite population inference is the ‘balls in a box’ 
example (Exercise 1.7), where the model is: 

(y |) ~ Hyp(N,0,n) 

0 ~ DU UN), 
and where the quantity of interest 0 is the number of red balls initially 
in the selected box (1,2,...,8 or 9). 


And another example of infinite population inference is the ‘loaded dice’ 
example (Exercises 1.4 and 1.5), where the model is: 


2 

f Cy|0) | je'a-or". y =0,1,2 
y 

f (8) 2100 / 6, 0 — 0.1,0.2,0.3, 
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and where the quantity of interest @ is the probability of 6 coming up on 
a single roll of the chosen die (i.e. the average number of 6s that come 
up on a hypothetically infinite number of rolls of that particular die). 


Generally, finite population inference may also be thought of in terms of 
prediction (e.g. in the ‘buses’ example, we are predicting the total 
number of buses in the town). For that reason, finite population 
inference may also be referred to as predictive inference. Yet another 
term for finite population inference is descriptive inference. In contrast, 
infinite population inference may also be called analytic inference. More 
will be said on finite population/predictive/descriptive inference in later 
chapters of the course. 


1.9 Continuous data 


So far, all the Bayesian models considered have featured data which is 
modelled using a discrete distribution. (Some of these models have a 
discrete parameter and some have a continuous parameter. The 
following is an example with data that follows a continuous probability 
distribution. (This example also has a continuous parameter.) 


Exercise 1.9 The exponential-exponential model 


Suppose @ has the standard exponential distribution, and the conditional 
distribution of y given @ is exponential with mean 1/0. Find the 
posterior density of 0 given y . 


Solution to Exercise 1.9 


The Bayesian model here is: f(y|0)—0e ", y>0 
f(0)=e",0>0. 


So f(Oly) « f(A) f (y|0) xe’ x 0e -"" = ^e "0*9. y > 0, 


This is the kernel of a gamma distribution with parameters 2 and y + 1, 
as per the definitions in Appendix B.2. Thus we may write 

(8| y) ~ Gamma(2, y 4-1), 
from which it follows that the posterior density of @ is 


d 1y g^ g-*orra 


| Q 
f (8| y) — TO ,0» 0. 
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Exercise 1.10 The uniform-uniform model 


Consider the Bayesian model given by: 
(y| 4) ~U(0, 4) 
8 «U(.D. 


Find the posterior density of 0 given y. 
Solution to Exercise 1.10 


Noting that 0 < y < @ < 1, we see that the posterior density is 
0 0 1x(1/8 
riy -LO@FO1A) __1x(0/8) 


FO) [1xa/0)a0 


. AQ ._.=1 
logl—logy @logy 


,y«0«1. 


Note: This is a ‘non-standard’ density and strictly decreasing. To give a 
physical example, a stick of length 1 metre is cut at a point randomly 
located along its length. The part to the right of the cut is discarded and 
then another cut is made randomly along the stick which remains. Then 
the part to the right of that second cut is likewise discarded. The length 
of the stick remaining after the first cut is a random variable with density 
as given above, with y being the length of the finally remaining stick. 


1.10 Conjugacy 


When the prior and posterior distributions are members of the same class 
of distributions, we say that they form a conjugate pair, or that the prior 
is conjugate. For example, consider the binomial-beta model: 
(y | @) ~ Binomial(n,0) 
0 ~ Beta(a, 8) (prior) 
=> (0|y)~ Beta(o -- y, 8 4- n— y) (posterior). 
Since both prior and posterior are beta, the prior is conjugate. 


Likewise, consider the exponential-exponential model: 
f(y|0) 50e", y>0 
f(0)=e",0>0 (i.e. 0 ~ Gamma(11)) (prior) 
=> (0|y)~ Gamma(2, y 4-1) (posterior). 
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Since both prior and posterior are gamma, the prior is conjugate. 


On the other hand, consider the model in the buses example: 
(y|0) * DU(,...,0) 


g DU(1,...,5) (prior) 
20/47,0 23 
=> f(0|y23)2415/47,0z4 (posterior). 
12/470 25 


The prior is discrete uniform but the posterior is not. So in this case the 
prior is not conjugate. 


Specifying a Bayesian model using a conjugate prior is generally 
desirable because it can simplify the calculations required. 


l.l | Bayesian point estimation 


Once the posterior distribution or density f(0|y) has been obtained, 
Bayesian point estimates of the model parameter 0 can be calculated. 
The three most commonly used point estimates are as follows. 


* The posterior mean of 0 is 
f Of(@| y)dé if 0is continuous 


E@|y)= J 6dF Oly) = Y^6f(0]y) if 0 is discrete. 


* The posterior mode of 0 is 
Mode(0| y) = any value me which satisfies 


f (8 =m|x)= max f (6| x) 
or lim f (0| x) - sup f(0]x), 
or the set of all such values. 
* The posterior median of 0 is 
Median(0| y) = any value m of 0 such that 
P(0Xm|y)z21/2 
and P(0 > m|y)21/2, 
or the set of all such values. 


Note 1: In some cases, the posterior mean does not exist or it is equal to 
infinity or minus infinity. 
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Note 2: Typically, the posterior mode and posterior median are unique. 
The above definitions are given for completeness. 


Note 3: The integral f 0dF (0| y) is a Lebesgue-Stieltje's integral. This 


may need to be evaluated as the sum of two separate parts in the case 
where @ has a mixed distribution. In the continuous case, it is useful to 


think of dF(0 | y) as ae qp = f(8|y)d0. 


Note 4: The above three Bayesian point estimates may be interpreted in 
an intuitive manner. For example, 0's posterior mode is the value of 0 
which is ‘made most likely by the data’. They may also be understood in 
the context of Bayesian decision theory (discussed later). 


1.12 Bayesian interval estimation 


There are many ways to construct a Bayesian interval estimate, but the 
two most common ways are defined as follows. The 1—a (or 
100(1—a@)% ) highest posterior density region (HPDR) for 0 is the 
smallest set S such that: 

P(0cS|y)21—-o 


and f(0,1y) » f(0, y) if 6, € S and 0, S. 


Figure 1.6 illustrates the idea of the HPDR. In the very common 
situation where 0 is scalar, continuous and has a posterior density which 
is unimodal with no local modes (i.e. has the form of a single *mound"), 
the 1-a HPDR takes on the form of a single interval defined by two 
points at which the posterior density has the same value. When the 
HPDR is a single interval, it is the shortest possible single interval over 
which the area under the posterior density is 1—a. 


The 1—a central posterior density region (CPDR) for a scalar parameter 
0 may be defined as the shortest single interval [a,b] such that: 
P(0 «a|y) €a/2 


and P(0-b|y)xo/2. 
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Figure 1.6 An 8076 HPDR 


f(0|y) 


highest values of the posterior 
£-——~ such that area undeneath = 0.8 


area = 0.8 


K——— 80% HPDR —— 8 


Figure 1.7 illustrates the idea of the CPDR. One drawback of the CPDR 
is that it is only defined for a scalar parameter. Another drawback is that 
some values inside the CPDR may be less likely a posteriori than some 
values outside it (which is not the case with the HPDR). For example, in 
Figure 1.7, a value just below the upper bound of the 8096 CPDR has a 
smaller posterior density than a value just below the lower bound of that 
CPDR. However, CPDRs are typically easier to calculate than HPDRs. 


In the common case of a continuous parameter with a posterior density 
in the form of a single *mound' which is furthermore symmetric, the 
CPDR and HPDR are identical. 


Note 1: The 1-a CPDR for 0 may alternatively be defined as the 
shortest single open interval (a,b) such that: 
P@<aly)<a/2 
and P(@>b|y)<a/2. 


Other variations are possible (of the form [a,b) and (a,b]); but when the 
parameter of interest 0 is continuous these definitions are all equivalent. 
Yet another definition of the 1-a CPDR is any of the CPDRs as defined 
above but with all a posteriori impossible values of 0 excluded. 


Note 2: As regards terminology, whenever the HPDR is a single 
interval, it may also be called the highest posterior density interval 
(HPDI). Likewise, the CPDR, which is always a single interval, may 
also be called the central posterior density interval (CPDT). 
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Figure 1.7 An 8076 CPDR 


f(0|») 


K——— 8099 CPDR —— 


Exercise 1.11 A bent coin 


We have a bent coin, for which 0, the probability of heads coming up, is 


unknown. Our prior beliefs regarding 0 may be described by a standard 
uniform distribution. Thus no value of 0 is deemed more or less likely 
than any other. 


We toss the coin n = 5 times (independently), and heads come up every 
time. 


Find the posterior mean, mode and median of 0. Also find the 8096 
HPDR and CPDR for 0. 


Solution to Exercise l.11 


Recall the binomial-beta model: 
(y | @) ~ Binomial(n,0) 
0 ~ Beta(a, B), 
for which (0| y) ^ Beta(o + y, 8 4- n— y). 


We now apply this result with n= y 2 5 and o = =1 (corresponding 
to 0 ~ U(0,1)), and find that: 
(0| y) ~ Beta(1-4- 5,5 — 5+1) = Beta(6,1) 


[901p - 6-0 — 


—605, 0«0 «1 
B(6,1) 


0 
F(0|y) — ['ecat — e^, 0« 0 «1. 
0 
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Therefore: E(0| y) us = es 0.8571 
6+1 7 


6-1 
=a 
Median(0 | y) = solution in 0 of F(0| y)=1/2, i.e. 0° — 0.5 
= (0.5)'5 = 0.8909. 


Mode(0 | y) = 


Also, the 8096 HPDR is (0.25,1) = (0.7647,1) (since f(0|y) is strictly 


increasing), and the 80% CPDR is (0.1'°,0.9"°) = (0.6813,0.9826). The 


three point estimate and two interval estimates just derived are shown in 
Figure 1.8. 


Figure 1.8 Inference in Exercise |.1 I 


: © posterior mean 
diu E ^ posterior mode 
H X posterior median 


j -- 80% CPDR 
wl]; “+++ 8096 HPDR 


posterior density 
3 
L 


theta 


R Code for Exercise l.11 


options(digits=4); postmean=6/7; postmode=1; postmedianz0.5^(1/6) 
c(postmean,postmode,postmedian) # 0.8571 1.0000 0.8909 
hpdrzc(0.2^(1/6),1); cpdr=c(0.1,0.9)*(1/6) 

c(hpdr,cpdr) # 0.7647 1.0000 0.6813 0.9826 


X11(w=8,h=5); par(mfrow=c(1,1)); tv=seq(0,1,0.01); fv=dbeta(tv,6,1) 
plot(tv,fv,type="I"",lwd=3,xlab="theta", ylab="posterior density") 
points(c(postmean,postmode,postmedian),c(0,0,0),pchzc(1,2,4)) 
points(hpdr,rep(0.2,2),pch=16); lines(hpdr,rep(0.2,2),lty=3,lwd=2) 
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points(cpdr,rep(0.4,2), pchz16); lines(cpdr,rep(0.4,2),lty=2,lwd=2) 
abline(v2c(postmean,postmode,postmedian),Ityz3) 
abline(v2c(O,hpdr,cpdr),Ityz3); abline(h=c(0,6),Ity=3) 


legend(0.2,5.8,c(" posterior mean","posterior mode", 


"posterior median"), pchzc(1,2,4)) 
legend(0.2,2.8,c("80% CPDR","80% HPDR"),lty=c(2,3),lwd=c(2,2)) 


Exercise 1.12 HPDR and CPDR for a discrete parameter 


Consider the posterior distribution from Exercise 1.7 (Balls in a box): 
14/420 = 0.03333, 0 = 2 


36/420 = 0.08571, 0 23 
60/420 = 0.14286, 0 = 4 
f (8| y) 24 80/420 =0.19048, 0 =5 
90/420 —0.21429, 0 - 6 
84/420 = 0.20000, 0 =7 
56/420 = 0.13333, 0 =8. 


Find the 90% HPDR and 90% CPDR for @. Also find the 50% HPDR 
and 50% CPDR for @. For each region, calculate the associated exact 
coverage probability. 


Solution to Exercise |.12 


The 90% HPDR is the set {3,4,5,6,7,8}; 
this has exact coverage 1 — 14/420 = 0.9667. 


The 90% CPDR is the closed interval [3, 8]; 
this likewise has exact coverage 0.9667. 


The 50% HPDR is {5,6,7}; 
this has exact coverage (80 + 90 + 84)/420 = 0.6047. 


The 50% CPDR is [4, 7]; 
this has exact coverage (60 + 80 + 90 + 84)/420 = 0.7476. 


Note: The lower bound of the 50% CPDR cannot be equal to 5. 


This is because P(0 «5| y) =(14+36+60)/ 420 = 0.2619, which is not 
less than or equal to o / 2 = 0.25, as required by the definition of CPDR. 
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Exercise 1.13 Illustration of the definition of HPDR 


Suppose that the posterior probabilities of a parameter 0 given data y 
are exactly 1096, 4096 and 5096 for values 1, 2 and 3, respectively. Find 
S, the 4096 HPDR for 8. 


Solution to Exercise 1.13 


The smallest set S such that P(0 € S | y) > 0.4 is (2) or {3}. With the 
additional requirement that f (0, | y) 2 f(0,| y) if 0, € S and 6, ¢ S, we 
see that S = {3} (only). That is, the 4096 HPDR is the singleton set {3}. 


1.13 Inference on functions of the model 
parameter 


So far we have examined Bayesian models with a single parameter 0 
and described how to perform posterior inference on that parameter. 
Sometimes there may also be interest in some function of the model 
parameter, denoted by (say) 


y — g(0). 


Then the posterior density of y can be derived using distribution theory, 
for example by applying the transformation rule, 


fwly)= remse 
y 


in cases where y = g(@) is strictly increasing or strictly decreasing. 


5 


Point and interval estimates of y can then be calculated in the usual 
way, using f(v | y). For example, the posterior mean of y equals 


E( |y)= fy ftv | y)av . 


Sometimes it is more practical to calculate point and interval estimates 
another way, without first deriving f(w | y). 


For example, another expression for the posterior mean is 
Ely |y) - E(g(0)1 ) - | (0) f (01 yaa. 


3l 


Bayesian Methods for Statistical Analysis 


Also, the posterior median of v , call this M, can typically be obtained 
by simply calculating 

M - g(m), 
where m is the posterior median of @. 


Note: To see why this works, we write 
P(v «M |y)- P(g(0) «M |y) 
= P(g(0) « g(m)| y) = P(8«m|y) -1/2. 


Exercise 1.14 Estimation of an exponential mean 


Suppose that @ has the standard exponential distribution, and y given 0 
is exponential with mean 1/0. Find the posterior density and posterior 
mean of the model mean, y = E(y|0) 21/0, given the data y. 


Solution to Exercise l.14 


Recall that the Bayesian model 
f(y|0) 0e ", y>0 
f(0)=e',0>0 

implies the posterior (0 | y) ~ Gamma(2, y +1). 


So, by definition, (W | y) ~ InverseGamma(2, y +1), 


M Gy +D ye meom» (y+) 
with density f(y) = = Gagan V>, 


1 
and mean EEI) o 2T = y41. 


Note: This mean could also be obtained as follows: 


il Fail 
E(b|y)=E ap]- J5 reine 


oo 


2 92-1 ,—0(y41) 
= peers 0* e 


0 0 Ic 
SED OE Tol (Xll ec qm 
ICT ie oe TO 


=y+1 (using the fact that the last integral equals 1). 
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Exercise |.15 Inference on a function of the binomial 
parameter 


Recall the binomial-beta model given by: 
(y | @) ~ Binomial(n,0) 
0 ~ Beta(a, 8), 

for which (0| y) ~ Beta(o + y, 8 -- n— y). 


Find the posterior mean, density function and distribution function of 
w =@° inthe case where n = 5, y = 5, and a= 8 —1. 


Note: In the context where we toss a bent coin five times and get heads 
every time (and the prior on the probability of heads is standard 
uniform), the quantity v may be interpreted as the probability of the 
next two tosses both coming up heads, or equivalently, as the proportion 
of times heads will come up twice if the coin is repeatedly tossed in 
groups of two tosses a hypothetically infinite number of times. 


Solution to Exercise 1.15 


Here, (0| y) ~ Beta(14- 5,1-- 5 —5) ~ Beta(6,1) 
with pdf f (0| y) 2605,0«0 «1. 


2 


Now @=y"* and so, by the transformation method, the posterior 


density function of v is 
d0 1 = 
f(w|y)- (01) - -6y^ aad 2:23y?,0 «y «1. 


It follows that the posterior mean of y is 
1 
g = EW |y) = fy (3? )av - 0.75, 
0 


and the posterior distribution function of y is 


y V 
FQy |y) 7 | fy =tly)dt=[3rdt=y*,0<y «1. 
0 0 


33 


Bayesian Methods for Statistical Analysis 


Note 1: The posterior mean of y — 0* can also be obtained by writing 
1 
i = E(8" |y) =| 0" (66*)ae - 0.75 
0 
or f =E(O | y) -V(8| y) -tE(0 yF 
2 
ee NGC Ga - 0.75 
(641) (64141) (6+1 
or (w|y)^ Beta(31) > w-E(v|y) =3/8+ 1) - 0.75. 


Note 2: The distribution function of v — 0^ can also be obtained by 
writing 
F(y -v| y) * P(v <v|y)=P(@ xv| y) (6x v^ |y) 
(gy) -|el. "a Mysi 


Note 3: In the above, f(y =t|y) denotes the pdf of w given y, but 
evaluated at t. This pdf could also be written as f,(t|y) or as 


| Fwy), |- Likewise, Fy -vly) = Fviy)=[FWIy),., |- 


1.14 Credibility estimates 


In actuarial studies, a credibility estimate is one which can be expressed 
as a weighted average of the form 
C =(1-k)A+KB, 
where: 
A is the subjective estimate (or the collateral data estimate) 
B is the objective estimate (or the direct data estimate) 
k is the credibility factor, a number that is between 0 and 1 
(inclusive) and represents the weight assigned to the 
objective estimate. 


A high value of k implies C = B, representing a situation where the 
objective estimate is assigned ‘high credibility’. A primary aim of 
credibility theory is to determine an appropriate value or formula for k, 
as is done, for example, in the theory of the Bühlmann model 
(Bühlmann, 1967). Many Bayesian models lead to a point estimate 
which can be expressed as an intuitively appealing credibility estimate. 
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Exercise 1.16 Credibility estimation in the binomial-beta 
model 


Consider the binomial-beta model:  (y|0) ^ Binomial(n,0) 
0 ~ Beta(a, 8). 


Express the posterior mean of 0 as a credibility estimate and discuss. 


Solution to Exercise |.16 


Earlier we showed that 

(8| y) ~ Beta(o 4- y, 8 4- n— y), 
and hence that the posterior mean of 0 is 

^ üt sr 
(a+y)+(G+n-y) a+G+n 


Observe that the prior mean of Q is E0 = a/ (a+ 8), and the maximum 
likelihood estimate (MLE) of @ is y/n. This suggests that we write 
2 m p 7 
a+8+n at+f+n 
Qa a+ ? 
o d 8 n 
a+ 
a+ 8 - n 


Q 


xm] 
a+@) a+ G+n\n 


mes 
a+8) at+tG6+n\n) 


Thus Ó-(1-k)A4 kB 
= a " Bue TUN . 
atf n a+ 8-4 n 


where: 


We see that the posterior mean Ó isa credibility estimate in the form of 
a weighted average of the prior mean A= EO =a / (œ + f) and the MLE 
B= y/n, where the weight assigned to the MLE is the credibility factor 
given by k=n/(n+a+/{). Observe that as n increases, the credibility 
factor k approaches 1. This makes sense: if there is a lot of data then the 
prior should not have much influence on the estimation. 


Figure 1.9 illustrates this idea by showing relevant densities, likelihoods 
and estimates for the following two cases, respectively: 
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(a) n=5, y=4, a =2, B=6 
(b) n=20,y=16, a =2, p =6. 


In both cases, the prior mean is the same (A = 2/(2 + 6) = 0.25), as is the 
MLE (B = 4/5 = 16/20 = 0.8). However, due to n being larger in case (b) 
(i.e. there being more direct data), case (b) leads to a larger credibility 
factor (0.714 compared to 0.385) and hence a posterior mean closer to 
the MLE (0.643 compared to 0.462). 


Note: Each likelihood function in Figure 1.9 has been normalised so that 
the area underneath it is exactly 1. This means that in each case (a) and 
(b), the likelihood function L(0) as shown is identical to the posterior 
density which would be implied by the standard uniform prior, i.e. under 


Inm (0) m penny (0) D Thus, L(0) = Ioco) * 


Figure 1.9 Illustration for Exercise 1.16 
Legend: solid line = prior, dashed line = likelihood, dotted line = posterior, 
circle = prior mean, triangle = MLE, cross = posterior mean 
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R Code for Exercise |.16 
X11(w=8,h=7); par(mfrowzc(2,1)) 


alp=2; bet=6; n = 5; y = 4; pvec=seq(0,1,0.01) 
plot(c(0,1),c(0,3),type="n",xlab="theta",ylab="density/likelinood") 
lines(pvec,dbeta(pvec,alp,bet),Ityz1,Iwdz2) 
lines(pvec,dbeta(pvec,1*y,n-y41),Ityz2,Iwdz2) 
lines(pvec,dbeta(pvec,alp+y,n-y+bet),lty=3,lwd=2) 


points(c(alp/(alp+bet), y/n,(alp+y)/(alp+bet+n)),c(0,0,0),pch=c(1,2,3), 
cex=rep(1.5,3),lwd=2); text(0,2.5,"(a)",cex=1.5) 

c(alp/(alp+bet), y/n,(alp+y)/(alp+bet+n)) # 0.2500000 0.8000000 0.4615385 

n/(alpt+tbet+n) # 0.3846154 


alp=2; bet=6; n = 20; y = 16; pvec=seq(0,1,0.01) 
plot(c(0,1),c(0,5),type="n"",xlab="theta",ylab="density/likelinood") 
lines(pvec,dbeta(pvec,alp,bet),Ityz1,Iwdz2) 
lines(pvec,dbeta(pvec,1+y,n-y+1),lty=2,lwd=2) 
lines(pvec,dbeta(pvec,alp+y,n-y+bet),lty=3,lwd=2) 


points(c(alp/(alp+bet), y/n,(alp+y)/(alp+bet+n)),c(0,0,0),pch=c(1,2,3), 
cex=rep(1.5,3),lwd=2); text(0,4.5,"(b)",cex=1.5) 

c(alp/(alp+bet), y/n,(alp+y)/(alp+bet+n)) # 0.2500000 0.8000000 0.6428571 

n/(alpt+tbet+n) tt 0.7142857 


Exercise 1.17 Further credibility estimation in the binomial- 
beta model 


Consider the binomial-beta model: 
(Y |0) ^ Binomial(n,0) 
0 ~ Beta(a, D) . 


If possible, express the posterior mode of 0 as a credibility estimate. 
Solution to Exercise 1.17 


Since (0| y) ~ Beta(a + y, 8 4- n — y), the posterior mode of @ is 
(a y 1) __aty-i 
(a+y-1)+(8+n-y-1) a+8+n-2` 


Mode(0 | y) = 
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(a—1) DE 


Now, the prior mode of 0 is Mode(0) = = ; 
(a—1)-(8—1 a+ 6-2 


So we write Mode(@ | y) = Bem y 
a+6+n—-2 at@+n—-2 
u a -—1 =] o —1 E "i 2 
ag n2 wel AoFp=2] tpt, 


We see that the posterior mode is a credibility estimate of the form 
Mode(0 | y) — (1— c) Mode(@) + c6 , 
where: Mode(0) — E is the prior mode 
ad 8—2 


is the maximum likelihood estimate 


D 
| 
>|% 


(mode of the likelihood function) 
n 


c= ————————- is the credibility factor 
n-- ac 8-—2 


(assigned to the direct data estimate, 6). 
Exercise 1.18 The normal-normal model 


Consider the following Bayesian model: 
Qu sy |W) ~ iid N(uo*) 
p N (hio): 
where 6^, p and o; are known or specified constants. 


Find the posterior distribution of ~ given data in the form of the vector 
y — (Qu X)- 


Solution to Exercise 1.18 


The posterior density of u is 
f (y) oc FU) f Qr|u) 
_ 1) HO Ho 
Oo 


2 
H 


x exp 


deke] 
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» (1.1) 


— exp 


1 1 (sz 7 
— (p? —2un, +m) +] Soy? = 2uny + nye? 
2 on o i=1 


where y =(y,+...+ y,)/n is the sample mean. 


We see that the posterior density of u is proportional to the exponent of 
a quadratic in u. That is, 


(1.2) 


3 


1 2 
f(uly)xexp| — 55 (ui — u.) 


which then implies that 
(u|y) ~ N(u o2), 
for some constants u, and o7. 


It remains to find the normal mean and variance parameters, p, and o; . 


(These must be functions of the known quantities n, y, o, u and o,.) 
One way to obtain these parameters which completely define pws 


posterior distribution is to complete the square in the exponent of (1.2). 
To this end we write 
a 
2 3 


q= pala — 21144) +—(-2uny + ny”) 


f (u| y) «exp 


where 


(ignoring constants with respect to u) 


1 n Ly ny 
2 
=p ae -2u "ur Fe 
(where c is a constant with respect to u) 
1 = 
=ap* —2bu+c where a=— +2 and bao 
o T Sy © 


/ 


2 
ett 
a a 


(where c' is a constant with respect to p) 


zaf -22 n+ =a 
a 
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Thus f(uly)« zl p 


b 2 
-o »-2 | à (1.3) 


So, equating (1.2) and (1.3), we obtain: 


2 1 1 oa. 
O, = — = — 
d dpa atio, 
2 2 
o T 
Ho (Y 
ASE d oo g 2 f 
a pa 0 NT 
a 


Note 1: A little algebra (left as an additional exercise) shows that the 
posterior mean can also be written as 


p. =(1—k) uy ky, 
and the posterior variance can be written as 


o 
o? =k—, 
n 
where 
n 
k= = 
n+— 
Oo 


We see that ws posterior mean is a credibility estimate in the form of a 
weighted average of the prior mean u, and the sample mean y (which 
is also the maximum likelihood estimate), with the weight assigned to y 
being the credibility factor, k . More will be said on this further down. 


Note 2: Another way to derive ju, and o is to write (1.2) as 


= (p? —2up. + per} (1.5) 


and then equate coefficients of powers of ju in (1.1) and (1.5). This logic 


f (u| y) « exp|— 


1 1 : y : 
leads to E and Se and ultimately the same 
Ta Eo: O o EE 


formulae for u, and o? as given by (1.4). 
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Note 3: Since both prior and posterior are normal, the prior is 
conjugate. 


Note 4: The posterior mean, mode and median of u are the same and 
equal to u.. The 1— 2 CPDR and 1— « HPDR for u are the same and 
equalto (vet zo, 5 


Note 5: The posterior distribution of uœ depends on the data 
vector y —(y,...,y,) only by way of the sample mean, i.e. 
y=(y,+...+y,)/n. Therefore, the main result, (u| y) ~ N(u,, 02), 
also implies that (j| y) ~ N(j,,02). 


That is, if we know only the sample mean y, the posterior distribution 
of u is the same as if we know y, i.e. all n sample values. Knowing the 
individual y; values makes no difference to the inference. 


Note 6: The formula for the credibility factor in Note 1, namely 


n 1 
k- 2 x 2 ? 
(o on 
Mi= 1 2 
Oo Oo 


makes sense in the following ways: 


(i) If the prior standard deviation o, is small then kz0 , so that 


A, ~ iy and c, %o,. Therefore (u| y) * N(u,,0;). 


That is, if the prior information is very ‘precise’ or ‘definite’, the data 
has little influence on the posterior. So the posterior is approximately 
equal to the prior; ie. f(u| y) ~ f(u), or equivalently, (j| y) * u. In 
this case the posterior mean, mode and median of u are approximately 
equal to 4. Also, the 1-œ CPDR and l—a HPDR for u are 


approximately equal to (44, £ z,,,0,). 


(ii) If c, is large then k~1, so that u, & y , e? eo^ /n, and so 
(uy) * N(y,o* / n). 
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That is, a large c, corresponds to a highly disperse prior, reflecting little 
prior information and so little influence of the prior distribution (as 
specified by 4 and o) on the inference. In this case the posterior 
mean, mode and median of u are approximately equal to y . Also, the 
1-« CPDR and 1-« HPDR for u are approximately equal to 
(Sawa enu Jn ). Thus, inference is almost the same as implied by the 
classical approach. 


(iii) If the sample size n is large then k~1, so that u, ~y and 


o? x° [ n. Therefore (u| y) * N(y,o* / n). 


So, in this case, just as when oc, is large, the prior distribution has very 


little influence on the posterior, and the ensuing inference is almost the 
same as that implied by the classical approach. 


Note 7: In the case of a priori ignorance (meaning no prior information 
at all) it is customary to take c, — oo, which implies that 
LENO) 


This prior on 4 appears to be problematic, because it is improper. 
However, it meaningfully leads to a proper posterior, namely 

(ul y)~ N(oo* /n), 
which then leads to the same point and interval estimates implied by the 
classical approach, namely the MLE y and 1—a@ CI (y £z,,o/ Jn )m 


The improper prior u ^ N(0,oo) may be described as ‘flat’ or ‘uniform 
over the whole real line’ and can also be written as 
u ~ U (7, 0) 
or f(y wen. 
In some cases (more complicated models not considered here), using an 
improper prior may lead to an improper posterior, which then becomes 


problematic. For more information on this topic, see Hobert and Casella 
(1996). 
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Summary: For the normal-normal model, defined by: 
Qus Y, LU) 7 iid N(u,o?) 
n N40); 
the posterior distribution of the normal mean yw is given by 
(u| y) * NQL, o7); 
where: 4, — (1— k)u, 4- ky 


2 


oi =k -= 
n 
k= o a (the normal-normal model credibility factor). 
pO d 


The posterior mean, mode and median of ju are all equal to 44, 
and the 1— 2 CPDR and HPDR for u are both (44 +2,,.0.). 


In the case of a priori ignorance it is appropriate to set o, =%. 


This defines an improper prior 
f (u) ec1,u e 
and the proper posterior 


(u|y) " N(o, c? /n). 


Exercise |.19 Practice with the normal-normal model 


In the context of the normal-normal model, given by: 
Qus, |U) 7 tid N(u o?) 
p^ N(n,os), 
suppose that y = (8.4, 10.1, 9.4), o =1, 4, =5and o, = 1/2. 


Calculate the posterior mean, mode and median of yw. 
Also calculate the 95% CPDR and 9596 HPDR for u. 


Create a graph which shows these estimates as well as the prior density, 
prior mean, likelihood, MLE and posterior density. 
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Solution to Exercise 1.19 


Here: n=3, 
y = (84 + 10.1 + 9.4)/3 = 93 
k- —Ó -2 = 0,4285714 
173, 7 
(1/2)? 


p. = 31-3» — 6.8428571 


2 
o: = Ld tz 0.1428571. 
7 3 7 


So the posterior mean/mode/median is 
jt. = 6.84286, 
and the 95% CPDR/HPDR is 
(4. £ Zp 9950.) = (6.84286 +1.96/0.14286) 
= (6.102, 7.584). 


Figure 1.10 shows the various densities and estimates here, as well as the 
normalised likelihood. Note that the likelihood function as shown is also 
the posterior density if the prior is taken to be uniform over the whole 
real line, i.e. 4 ~ U(—00, 0). 


Discussion 


If we change o, from 0.5 to 2 we get k = 0.923 and results as illustrated 
in Figure 1.11. 


If we change o, from 0.5 to 0.25 we get k = 0.158 and results as 
illustrated in Figure 1.12 (page 46). 


If we keep o, as 0.5 but change o from 1 to 2 we get k = 0.158 and 
results as illustrated in Figure 1.13 (page 46). 


Note that the posteriors in Figures 1.12 and 1.13 have the same mean but 
different variances. 
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1.10 Results if o, 20.5, &-1, k=n/(n+o0°/o;) =0.429 


—— Prior density 
—-— Likelihood function (normalised) 
Posterior density 


© Prior mean 

A Sample mean (MLE) 
X Posterior mean 

€ 95% CPDR bounds 
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Figure |.11 Results if 0, 22, 6-1, k2n/(n*o^/0;)- 0.9223 
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Figure 1.12 Results if 5, 20.25, c=1, k2n/(n*-6^/0,)-0.158 


1.2 


=| —— Prior density 
—-— Likelihood function (normalis¢d) 
d **** Posterior density 


1.0 


d © Prior mean 
A Sample mean (MLE) 
m X Posterior mean 
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Figure 1.13 Results if c, 20.5, 6-2, k 2n/(n*o^/0;) -0.158 
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46 


Chapter |: Bayesian Basics Part | 


R Code for Exercise 1.19 
X11(w=8,h=5); par(mfrowzc(1,1)); mu0z5; sig0z0.5; sig=1 


y = c(8.4, 10.1, 9.4); n = length(y); k21/(1*(sig^2/n)/sig0^2); k # 0.4285714 
ybarzmean(y); ybar # 9.3 

mus = (1-k)*mu0 + k*ybar; sigs2-k*sig^2/n 

c(mus,sigs2) # 6.8428571 0.1428571 

muvzseq(0,15,0.01) 

prior = dnorm(muv,muO,sigO); postzdnorm(muv,mus,sqrt(sigs2)) 

like = dnorm(muv,ybar,sig/sqrt(n)) 
cpdr=mus+c(-1,1)*qnorm(0.975)*sqrt(sigs2) 

cpdr # 6.102060 7.583654 


plot(c(0,11),c(-0.1,1.3),type="n",xlab="",ylab="density/likelinood") 
lines(muv, prior, Ity=1,lwd=2); lines(muv,like,lty=2,lwd=2) 
lines(muv, post, Ity=3,lwd=2) 
points(c(muO,ybar,mus),c(0,0,0), pch=c(1,2,4),cex=rep(1.5,3),lwd=2) 
points(cpdr,c(0,0),pch=rep(16,2),cex=rep(1.5,2)) 
legend(0,1.3, 

c("Prior density","Likelihood function (normalised)","Posterior density"), 

Ity=c(1,2,3),lwd=c(2,2,2)) 

legend(0,0.7,c("Prior mean","Sample mean (MLE)","Posterior mean", 

"95% CPDR bounds"), pch=c(1,2,4,16),pt.cex=rep(1.5,4), pt.lwd=rep(2,4)) 
text(10.8,-0.075,"m", vfont=c("serif symbol", "italic"), cex=1.5) 


# Repeat above with sigOz2 to obtain Figure 1.11 
# Repeat above with sigO=0.25 to obtain Figure 1.12 
# Repeat above with sigO=0.5 and sig=2 to obtain Figure 1.13 


Exercise 1.20 The normal-gamma model 
Consider the following Bayesian model: 
(Vise ¥n |A) 7 tid N(u1/A) 
A ~ G(a, 8). 


Find the posterior distribution of À given y = (y,,..., y,). 
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Note 1: In the normal-normal model, the normal mean u is unknown 


and the normal variance c^ is known. Now we consider the same 
Bayesian model but with those roles reversed, i.e. with u known and o* 


unknown. For an example of where this kind of situation might arise, see 
Byrne and Dracoulis (1985). 


Note 2: For reasons of mathematical convenience and conjugacy, we 
parameterise the normal distribution here via the precision parameter 


Ame: 
rather than using c^ directly as before in the normal-normal model. 


Note 3: An equivalent formulation of the normal-gamma model being 
considered here is: 


Qu Y,107) 7 tid N(u,07) 
o^  IG(o, 3), 
where this may be called the normal-inverse-gamma model. 


Solution to Exercise 1.20 


The posterior density of A is 


Alyx f OO f (y 1A) 


2 
TERR m 1 1|y,— 
xA le 9 x | | ——ex a i - 
D |- 1/VX 
sN 1 e xA"? apl A 0,- m | 


i=1 


= A* le" for some a and b. 


We see that 
(Aly) * G(a,b), 


n 
where: a= a 
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Note 1: The posterior mean of A, namely 


a Gey 2 
BOY) = ERE 
b. @-ens i 2 
e 1 
converges to À = — (the MLE of A) as n> œ. 
S 
H 


If æ= 8-0 then E(A| y) - X exactly for all n. 


Note 2: Unlike the posterior mean of 4 in the normal-normal model, the 


posterior mean of À cannot be expressed as a credibility estimate of the 
form 


Uc cy, 


WIE LE z (the prior mean of A) 


|= 


A — — (the MLE of A). 
S 


ISS) 


Note 3: We may write the posterior as 
2atn 28-+ns? 
AD G a 


E 


2 2 


It can then be shown via the method of transformations that 
2 iJ il 
(uly)~G — d~ eom. 


where u = (28 - ns?)A . 


So the 1- A CPDR for uis (x ,,(2a +n), xij, Qa +N). 


u 
D 
2p ns, 


OBS +n) Pale +n) 
28-ns; ° 28-+ns* 
28 4- ns; 28 4- ns; 

na 20) aus) 


So the 1— A CPDR for 2= 


So the 1— A CPDR for c? == is 


If a = f =0, this is exactly the same as the classical 1— A CI for o^. 
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Note 4: The classical 1— A CI for o^ may be derived as follows. First 
consider all parameters fixed as constants. Then 


ACH VIRTU iid N(0,1). 
oO Oo 


So 
2 
(2>) (2 £) ~ iid y^(1). 
(oy 
So 
e mommy 
¢ ]- L~ x (n) 
i=1 Oo 
So 


2 
5 ns, 5 
1—- A-P Xi aot) e Fa < Wan) 
ns? ns? 
-P|———«sg «——— |. 
Harn) Kan) 


Note 5: Notes 1 to 3 indicate that in the case of a priori ignorance, a 
reasonable specification is 

az DD. 
or equivalently, 

f()«1/A, A»0. 


This improper prior may be thought of as the limiting case as € — 0 of 
the proper prior 

A ~ Gam(e,e), 
where ¢€ 0. 


Observe that 
EX=ele=1 
for all £, and 
Vi == C3 
ase—0O. 
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Summary: For the normal-gamma model, defined by: 
(Y-Y, |A) v tid N(qu1/ A) 
A ~ G(o, B), 
the posterior distribution of A is given by 
(Aly) ~ G(a,b), 


n n 1< 
where: gie, b=p+7 S; > GT O 


The posterior mean of A is a/b. The posterior median is Pos (1/2). 
The posterior mode of A4 is (a—1)/b if a> 1; otherwise that mode is 0. 


The 1- A CPDR for 4 is (Foon (A/ 2, Foen (1— A/2)) 


and may also be written as 3 7 
26 +ns; 28 ns, 


26 4- ns; 20 4- ns; 


The 1- A CPDR for o° 21/4 is |— = 
Xa (2a t n) x; ,4Qa-n) 


In the case of a priori ignorance it is appropriate to set a= 5-0. 
This defines an improper prior with density 

f(4)*1/A,4»0, 
and a proper posterior distribution given by 


(ns; | y) * x (n). 


Exercise 1.21 Practice with the normal-gamma model 


In the context of the normal-gamma model, given by: 
Qu sy, |A) ~ iid N(u1/A) 
A ~ Gamma(a, p), 
suppose that y = (8.4, 10.1, 9.4), u =8, a =3and f =2. 


(a) Calculate the posterior mean, mode and median of the model 
precision A. Also calculate the 95% CPDR forA. Create a graph which 
shows these estimates as well as the prior density, prior mean, 
likelihood, MLE and posterior density. 
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(b) Calculate the posterior mean, mode and median of the model 


variance o^ —1/A. Also calculate the 95% CPDR for c^. Create a 
graph which shows these estimates as well as the prior density, prior 
mean, likelihood, MLE and posterior density. 


(c) Calculate the posterior mean, mode and median of the model 
standard deviation c. Also calculate the 9596 CPDR for c. Create a 
graph which shows these estimates as well as the prior density, prior 
mean, likelihood, MLE and posterior density. 


(d) Examine each of the point estimates in (a), (b) and (c) and determine 
which ones, if any, can be easily expressed in the form of a credibility 
estimate. 


Solution to Exercise 1.21 


(a) The required posterior distribution is (A | y) ~ Gamma(a,b), where: 
a—a + =4.5, b= 8*5, = 5265, s? = IY (y, - p)? = 2477. 
ni 


So: 

e the posterior mean of 4 is E(A| y)=a/b = 0.8547 

e the posterior mode is Mode(A | y) = (a — 1)/ b = 0.6648 

* the posterior median is the 0.5 quantile of the G(a,b) distribution 

and works out as Median(A | y) = 0.7923 

(as obtained using the qgamma() function in R; see below) 

e the 95% CPDR for A is (0.2564, 1.8065) (where the bounds are 
the 0.025 and 0.975 quantiles of the G(a,b) distribution). 


Also: 
e the prior mean is EA =a/ 8 - 1.5 
e the prior mode is Mode(A) = (a —1)/ 8 =1 
e the prior median is Median(A) = 1.3370 
* the MLE of 4 is À—1/s? = 0.4594 
(note that this estimate is biased). 


Figure 1.14 shows the various densities and estimates here, as well as the 
normalised likelihood function. 
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Note: The normalised likelihood function (with area below equal to 1) is 
the same as the posterior density of A if the prior is taken to be uniform 
over the positive real line, i.e. 4 ^ U(0,oo). This prior is specified by 
taking a = 1 and 8 = 0, because then f (A) œ Ae ^ 1. 


Figure 1.14 Results for Exercise |.21(a) 


Inference on the model precision parameter 


B — Prior density Prior mode, median 
Sg 24 =~~- Likelihood function (normalised) & mean (left to right) 
£ * Posterior density A MLE 
g a os. 
— = F A 
E Pg. e e Posterior mode, median 
E wl Jj ~—: & mean (left to right) 
o 9 € 95% CPDR bounds 
S4 e Axxo oo € 7-7------0—————— 
aM es)! hhh Ul lh 
0 1 2 3 4 5 


lambda 


(b) As regards the model variance o^ =1/ we note that o° ~ IG(a, 8) 
with density 


2 dA 2 -1 
f(o’)= ro where 4 - (c?) 
Pee sa 
- UE |->) 
sL yy e 52.9. (1.6) 
Tr(a) i 


Then, by well-known properties of the inverse gamma distribution and 
maximum likelihood theory: 


* the prior mean of c^ is Eo? = 8/(a—1) =1 
* the prior mode is Mode(c^) = 8 / (a +1) = 0.5 
* the prior median is Median(o^) = 1/ Median(A) = 0.7479 
* the MLE of o? is 6? 21/À = s? = 2.1767 
(note that this estimate is unbiased). 
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By analogy with the prior (1.6), we find that (c^ | y) ~ IG(a,b) with 
density 


a 


(o y rige e? -0 
I'(a) 


f(c^ly)- 


and hence that: 
* the posterior mean of o° is E(c? | y) 2 b/(a—1) = 1.5043 
* the posterior mode is Mode(c^ | y) = b/ (a +1) = 0.9573 
* the posterior median is 
Median(o? | y) =1/ Median(A | y) = 1.2622 
(since 1/2 = P(c? «m|y)  P1/A«m|y)- P(1/m«4A|y)) 
e the 95% CPDR for o^ is (0.5535, 3.8994) (where the lower 


and upper bounds are the inverses of the 0.975 and 0.025 
quantiles of the G(a,b) distribution, respectively). 


Figure 1.15 shows the various densities and estimates here, as well as the 
normalised likelihood function. 


Note: The normalised likelihood function is the same as the posterior 
density of o^ if the prior on o^ is taken to be uniform over the positive 
real line, i.e. f (0^) o: 1, o° » 0. This prior is specified by \ ~ G(—1,0), 
i.e. by œ ——1 and f 2 0 as is evident from (1.6) above. 


Figure 1.15 Results for Exercise 1.21 (b) 


Inference on the model variance parameter 


N aS | 
| — Prior density o Prior mode, median 
g —-— Likelihood function (normalised) & mean (left to right) 
g Ld " ^*** Posterior density A MLE 
E m í ES X Posterior mode, median 
S o 7 : s. & mean (left to right) 
© | : ^ € 95% CPDR bounds 
e H PCR EE KM 
2 UCM n e a aaee 


sigma^2 = 1/lambda 
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(c) As regards the model standard deviation o = 1/ VX, observe that the 
prior density of this quantity is 


dA 
f(c) — [rs 


y as =. — A =a 
= Bu Ps le i I-2e?|- 20" get —Blo* 


where 42 o0? 


e ya>0. (1.7) 


(a) TOM 
We find that: 
e the prior mean of o is 
eo a q) a-l -pà 
Ec =E” = ja” PA e gi 


Ro 


B°T(a-1/2) F f B? 

B* Tla) = T TE 
m 1(a-1/2) _ 
"E I(a) 


e the prior mode of o is Mode(c) = LE. = 0.7559 
2-1 


(obtained by setting the derivative of the logarithm of (1.7) 
to zero, where that derivative is derived as follows: 
l(c) = log f (c) = —(2a +1) logo — Bo + constant 


set 2 
WHL 98g es ghe B 
Oo 


= 0.9400 


=I'(o)=- 


* the prior median of c is Median(c) = ,{Median(o*) = 0.8648 
*the MLE of o is 6 — s? = 14754. (which is biased). 


2b" 2 
By analogy with the above, f(c|y)— : g^ utt asp, 


Tr(a) 
So we find that: 
* the posterior mean of o is E(o | y) =b" I(a71/2) - 13836 
(a) 
l . 2b 
e the posterior mode is Mode(c | y) = ,/——— = 1.0262 
2a+1 
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* the posterior median is 


Median(o | y) = 4Median(o^ | y) = 1.1235 
(since 1/2 - P(o? « m| y) - P(o « Jm | y)) 
* the 9596 CPDR for c is (0.7440,1.9747) (where these bounds 
are the square roots of the bounds of the 95% CPDR for c^ ). 


Figure 1.16 shows the various densities and estimates here, as well as the 
normalised likelihood function. 


Note: The normalised likelihood function is the same as the posterior 
density of o if the prior on o is taken to be uniform over the positive 
real line, i.e. f(a) oc 1, o > 0. This prior is specified by A ~ G(—1/2,0), 
i.e. by æ =—1/2 and f =0, as is evident from (1.7) above. 


Figure l.16 Results for Exercise |.21(c) 


Inference on the model standard deviation parameter 


— Prior density Prior mode, median 
~-~- Likelihood function (normalised) & mean (left to right) 
* Posterior density A MLE 


Posterior mode, median 
Be se & mean (left to right) 
= Cs] a E € 95% CPDR bounds 


-- Se = 
IP le emmm 


density/likelihood 
2 
i 


0.0 0.5 1.0 1.5 2.0 25 


sigma = 1/sqrt(lambda) 


(d) Considering the various point estimates of 2, o^ and o derived 
above, we find that two of them can easily be expressed as credibility 
estimates, as follows. First, observe that 


b B -- ns; 12 20 4- ns: 


Elo’ = = = 

(œ Iy) a—1 at+(n/2)-1 2a+n-2 
2 
"—— +e 
n+2a—2})" n+2a—2 
where 
2 = = 
20 f 9-1, B^... 362 a 


1400-2 — mig-2 
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We see that the posterior mean of c^ is a credibility estimate of the form 
E(c? |y) - Q- c)Ec? + cs;, 


where: 
Eo’ = : is the prior mean of c 
Q= 
2 1 * 2 2 
S, =F 2 —pn) isthe MLE of c 
c = ————— is the credibility factor (assigned to the MLE). 
n+2a—2 
Likewise, 
+ns* /2 28+ns* 
Noakes = EN R4 d : EE ALS H 
a+1 &+(n/2)+1 2a+n+2 
a M 
n+2a+2)" n+2a+2 
where 
2 — — 2f ati f 
n+2a+2 n+2a+2 J atl 
2 2 
—. £e t4 'xMode(o?) 
n4- 2a 4-2 
-[ -— Z Modeco’). 
n+2a+2 


We see that the posterior mode of c^ is a credibility estimate of the form 
Mode(o? | y) = (1— d)Mode(c? ) + ds? 


n? 
where: 


Mode(o*) = = 2 
a+1 


is the prior mode of o° 
s = 250; — u)? isthe MLE of c? 
nat 
(i.e. mode of the likelihood function) 
n 


= ————— js the credibility factor (assigned to the MLE). 
n+2a+2 
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R Code for Exercise 1.21 


# (a) Inference on lambda ----------------------------------------------- 


y = c(8.4, 10.1, 9.4); n = length(y); muz8; alp=3; betz2; options(digits=4) 
a-alp*n/2; sigmu22mean((y-mu)^2); b=bet+(n/2)*sigmu2 


c(a,sigmu2,b) # 4.500 2.177 5.265 


lampriormean=alp/bet; lamlikemode=1/sigmu2; lampriormode=(alp-1)/bet 
lampriormedian= qgamma(0.5,alp,bet) 

lampostmean=a/b; lampostmode=(a-1)/b; lampostmedian=qgamma(0.5,a,b) 
lamcpdr=qgamma(c(0.025,0.975),a,b) 


c(lampriormean,lamlikemode,lampriormode,lampriormedian, 
lampostmode,lampostmedian, lampostmean,lamcpdr) 
# 1.5000 0.4594 1.0000 1.3370 0.6648 0.7923 0.8547 0.2564 1.8065 


lamv=seq(0,5,0.01); prior=dgamma(lamv,alp, bet) 
post=dgamma(lamv,a,b); like=dgamma(lamv,a-alp+1,b-bet+0) 


X11(w=8,h=4); par(mfrowzc(1,1)) 


plot(c(0,5),c(0,1.9),type="n", 
main="Inference on the model precision parameter", 
xlab="lambda", ylab="density/likelihood") 
lines(lamv, prior, Ity=1,lwd=2); lines(lamv,like,Ityz2,Iwdz2); 
lines(lamv, post, Ity=3,lwd=2) 
points(c(lampriormean,lampriormode, lampriormedian, 
lamlikemode,lampostmode,lampostmedian,lampostmean), 
rep(0,7),pch=c(1,1,1,2,4,4,4),cex=rep(1.5,7),lwd=2) 
points(lamcpdr,c(0,0),pch=rep(16,2),cex=rep(1.5,2)) 


legend(0,1.9, 
c("Prior density","Likelihood function (normalised)","Posterior density"), 
Ity=c(1,2,3),lwd=c(2,2,2)) 
legend(3,1.9,c("Prior mode, median\n & mean (left to right)", 
"MLE"), pch=c(1,2),pt.cex=rep(1.5,4), pt.lwd=rep(2,4)) 
legend(3,1,c("Posterior mode, median\n & mean (left to right)", 
"95% CPDR bounds"), pch=c(4,16),pt.cex=rep(1.5,4), pt.lwd=rep(2,4)) 
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# (b) Inference on sigma2 = 1/lambda ------------------------------------------------- 


sig2priormean=bet/(alp-1); sig2likemode-sigmu2; sig2priormode=bet/(alp+1) 
sig2 postmean=b/(a-1); sig2postmode=b/(a+1); 

sig2 postmedian=1/lampostmedian 

sig2cpdr-1/qgamma(c(0.975,0.025),a,b); sig2priormedian= 1/lampriormedian 


c(sig2priormean, sig2likemode, sig2priormode, sig2priormedian, 
sig2postmode, sig2postmedian, sig2postmean, sig2cpdr) 
# 1.0000 2.1767 0.5000 0.7479 0.9573 1.2622 1.5043 0.5535 3.8994 


sig2v=seq(0.01,10,0.01); prior=dgamma(1/sig2v,alp,bet)/sig2v*2 
post-dgamma(1/sig2v,a,b)/sig2v^2; 
like=dgamma(1/sig2v,a-alp-1,b-bet+0)/sig2v2 


plot(c(0,10),c(0,1.2),type="n", 
main="Inference on the model variance parameter", 
xlabz"sigma^2 = 1/lambda" ylab-"density/likelihood") 
lines(sig2v,prior,Ityz1,Iwdz2); lines(sig2v,like,Ityz2,Iwdz2) 
lines(sig2v,post,Ityz3,Iwdz2) 


points(c(sig2priormean, sig2priormode, sig2priormedian, sig2likemode, 
sig2postmode, sig2postmedian,sig2postmean), 
rep(0,7),pch=c(1,1,1,2,4,4,4),cex=rep(1.5,7),lwd=2) 

points(sig2cpdr,c(0,0),pch=rep(16,2),cex=rep(1.5,2)) 


legend(1.8,1.2, 
c("Prior density","Likelihood function (normalised)","Posterior density"), 
Ity=c(1,2,3),lwd=c(2,2,2)) 
legend(7,1.2,c("Prior mode, median\n & mean (left to right)", 
"MLE"), pch=c(1,2),pt.cex=rep(1.5,4), pt.lwd=rep(2,4)) 
legend(6,0.65,c("Posterior mode, median\n & mean (left to right)", 
"95% CPDR bounds"), pch=c(4,16),pt.cex=rep(1.5,4), pt.lwd=rep(2,4)) 


# abline(h=max(like),lty=3) # Checking likelihood and MLE are consistent 
# fun=function(t){ dgamma(1/t,a-alp-1,b-bet+0)/t*2 } 
# integrate(f=fun,lower=0,upper=Inf)Svalue 

#1 Checking likelihood is normalised 
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# (c) Inference on sigma = 1/sqrt(lambda) --------------------------------------------- 


sigpriormean-sqrt(bet)*gamma(alp-1/2)/gamma(alp); 
siglikemode=sqrt(sigmu2); sigpriormode-sqrt(2*bet/(2*alp-1)) 
sigpostmean- sqrt(b)*gamma(a-1/2)/gamma(a) 

sigpostmode= sqrt(2*b/(2*a+1)); sigpostmedian-sqrt(sig2postmedian) 
sigcpdr=sqrt(sig2cpdr); sigpriormedianz sqrt(sig2priormedian) 


c(sigpriormean, siglikemode, sigpriormode, sigpriormedian, 
sigpostmode, sigpostmedian, sigpostmean, sigcpdr) 
# 0.9400 1.4754 0.7559 0.8648 1.0262 1.1235 1.1836 0.7440 1.9747 


sigv=seq(0.01,3,0.01); prior-xdgamma(1/sigv^2,alp,bet)*2/sigv^3 
post-dgamma(1/sigv^2,a,b)*2/sigv^3; 
like=dgamma(1/sigv’2,a-alp-1/2,b-bet+0)*2/sigv*3 


plot(c(0,2.5),c(0,4.1),type="n", 
main="Inference on the model standard deviation parameter", 
xlab="sigma = 1/sqrt(lambda)",ylab="density/likelihood") 

lines(sigv, prior, ty=1,lwd=2) 

lines(sigv, like, Ity=2,lwd=2) 

lines(sigv, post, Ity=3, lwd=2) 

points(c(sigpriormean, sigpriormode, sigpriormedian, siglikemode, 
sigpostmode, sigpostmedian,sigpostmean), 
rep(0,7),pch=c(1,1,1,2,4,4,4),cex=rep(1.5,7),lwd=2) 

points(sigcpdr,c(0,0),pch=rep(16,2),cex=rep(1.5,2)) 


legend(0,4.1, 
c("Prior density","Likelihood function (normalised)","Posterior density"), 
Ity=c(1,2,3),lwd=c(2,2,2)) 
legend(1.7,4.1,c("Prior mode, median\n & mean (left to right)", 
"MLE"), pch=c(1,2),pt.cex=rep(1.5,4), pt.lwd=rep(2,4)) 
legend(1.7,2.3,c("Posterior mode, median\n & mean (left to right)", 
"95% CPDR bounds"), pch=c(4,16),pt.cex=rep(1.5,4), pt.lwd=rep(2,4)) 
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2.1 Frequentist characteristics of Bayesian 
estimators 


Consider a Bayesian model defined by a likelihood f(y|0) and a prior 
f (0) , leading to the posterior 
0 0 
wmn- EO fo19 
f y) 


Suppose that we choose to perform inference on 0 by constructing a 


point estimate 6 (such as the posterior mean, mode or median) and a 
(1— 2) -level interval estimate I =(L,U) (such as the CPDR or HPDR). 


Then 6 , I, L and U are functions of the data y and may be written A( y), 
I(y), L(y) and U(y). Once these functions are defined, the estimates 
which they define stand on their own, so to speak, and may be studied 
from many different perspectives. 


Naturally, the characteristics of these estimates may be seen in the 
context of the Bayesian framework in which they were constructed. 
More will be said on this below when we come to discuss Bayesian 
decision theory. 


However, another important use of Bayesian estimates is as a proxy for 
classical estimates. We have already mentioned this in relation to the 
normal-normal model: 


ns, 1) ~ iid N(u,o?) 

p^ N(n,o,), 
where the use of a particular prior, namely the one specified by o, =~, 
led to the point estimate /; = fi(y) = y and the interval estimate 


I) = (L(y), U(y)) =F + Z0 | Vn). 


As we noted earlier, these estimates are exactly the same as the usual 
estimates used in the context of the corresponding classical model, 
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s% 2 
Yo- Y, “iid NUS). 
where u is an unknown constant and c^ is given. 


Therefore, the frequentist operating characteristics of the Bayesian 
estimates are immediately known. In particular, we refer to the fact that 
the frequentist bias of ĝ is zero, and the frequentist coverage probability 
of I is exactly 1— «. These statements mean that the expected value of 
y given yu is u for all possible values of u, and that the probability of 
j being inside I given u is 1—« for all possible values of u. 


More generally, in the context of a Bayesian model as above, we may 
define the frequentist bias of a Bayesian point estimate 


ô= Ó(y) 
as 
B, = E(&(y) - 040). 


Also, we may define the frequentist relative bias of Ó as 


py =z) -2 (0+0). 
6 0 


Furthermore, we may define the frequentist coverage probability (FCP) 
of a Bayesian interval estimate 


I(y) = LO), UY) 


C, = P(0 € I(y)10}. 


as 


Thus, for the normal-normal model with o, — o», we may write: 
B,-E(A(y)- ul - E(y|u)-u2 u4-u-20 Vue 
0 
R,-—-0 (u#0) 
u 
C, =P{u E I(y)| ui 


Oo 


HPF tua eene ate Vuem. 


The above analysis is straightforward enough. However, in the case of 
an informative prior (one with o, «oo ), or in the context of other 
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Bayesian models, the frequentist bias of a Bayesian point estimate ( B, ) 
and the frequentist coverage probability of a Bayesian interval estimate 
(C, ) may not be so obvious. Working out these functions may be useful 


for adding insight to the estimation process as well as for deciding 
whether or not to use a set of Bayesian estimates as frequentist proxies. 


Exercise 2.1 Frequentist characteristics of estimators in the 
normal-normal model 


Consider the normal-normal model: 
(Qs Yn |H) v tid N(u,07) 
i Ns). 


Work out general formulae for the frequentist and relative bias of the 
posterior mean of u, and for the frequentist coverage probability of the 
1-a HPDR for u. 


Produce graphs showing a number of examples of each of these three 
functions. 


Solution to Exercise 2.1 


Recall that 
(uly) ~ N(u,,02), 
where: 
fl, — (1—k)u, +ky is ws posterior mean 
"Wu a, 
o. =k— is pws posterior variance 
n 


= — —7,——, is a credibility factor. 
nto /o, 
Also, recall that ws HPDR (and CPDR) is 


(i EZ sd). 
Using these results, we find that the frequentist bias of the posterior 
mean of u is 


B, = E(u.- “| i) d- ous + kE(y | )- u 
-ü-k)utku-u 
=(1-k)( - 2). 
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Also, the frequentist relative bias of that mean is 
p «Bs 0-7 - 2) 
"oH u 


-a-e(&-i| (#0). 


Further, the frequentist coverage probability of the 1-a@ HPDR for p is 
C,- Pfu E (L, tz,,,0.) uj 


= P(u.—z,,0. SUPE Zn 


u) 


= P (u. =Z um Ue IL Zu. u) 
= P((1- kK) My + ky = z,5,0. < u, u « d- k)us * ky + z,,0. u) 


[- ROLL. eS «| 


- P(y « b(u), a(u) « y|u), 


=P 


where: 

Hu - 0 - k)us + z,50. 
k 

u - (0 - k)u = 25.0 . 
k 


b(u) = 
a(u)- 


Thus, we find that 
C, - P( a(u) « y «bG)|u) 
= p( 227A u YH Jbüu)-u 


t) 


" i Loia 
where Z ~ N(0,1), since s N(0,1) 
-0( HoH A). of ae) 

o | 4n c | 4n 


Note: Here, ® denotes the standard normal cdf. 


c | dn T 
a(u)-u b(u)-u 
-» (sce c | n PES a) 
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Figures 2.1, 2.2 and 2.3 (pages 66 and 67) show B,, R, and C, for 
selected values of o,, with n=10, 4, 21, o=1 and a —0.05 in each 
case. The strength of the prior belief is represented by o,, with large 
values of this parameter indicating relative ignorance. 


In Figure 2.1, we see that, for any given value of 4, the frequentist bias 
B, of the posterior mean 4. = E(u| y) converges to zero as the prior 


belief tends to total ignorance, that is, in the limit as o, >. 


Also, B, — 4 — 4 as the prior belief tends to complete certainty, that 


is, in the limit as o, — 0. 


Note: One of the thin dotted guidelines in Figure 2.1 shows the function 
B, -,4,-— u in this latter extreme case of ‘absolute’ prior belief that 


L= Lh. In all of the examples, 4, =1. 


In Figure 2.2, we see that, for any given value of w, the frequentist 
relative bias R, of the posterior mean 4, = E(u |y) converges to zero 


as o >% . Also, R, —(u,/ u) -1as o, 0. 


Note: The curved thin dotted guidelines in Figure 2.2 shows the function 
R, — (4 / u) - 1 in this latter extreme case of ‘absolute’ prior belief that 


H= Hh: 


In Figure 2.3, we see that, for any given value of w, the frequentist 
coverage probability C, of the 1— a (i.e. 0.95 or 95%) HPDR, namely 


(Us X z,,,0.), converges to 1— « as 0, >œ. 


Also, C, —> 0 as 0, — 0, except at exactly 4; = 4, where C, 1; 
thus, C, —I(u-4,) as 6,0 (where I denotes the standard 
indicator function). 


Note: In Figure 2.3, the thin dotted horizontal guidelines show the 
values 0, 0.95 and 1. 
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Figure 2.1 Frequentist bias in Exercise 2.1 
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Figure 2.3 Frequentist coverage probability in Exercise 2.1 


nr MM ENCHIRIDION 
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2 —— sig0=0.1 } \ ET 
t — - sig0-02 i ^n. 
x | --- sig0-0.5 t ` 
o -— sig0-10| | L . 


R Code for Exercise 2.1 


biasfun = function(mu,n,sig,muO,sigO)( 
k = n/(n*(sig/sigO)^2) 
(1-k)*mu0-mu*(1-k) } 


coverfun = function(mu,n,sig,mu0,sigO,alp=0.05){ 
k = n/(n + (sig/sigO)^2) 
sigstar = sig*sqrt(k/n); zzqnorm(1-alp/2) 
a-( mu-(1-k)*muO-z*sigstar )/ k 
b-( mu-(1-k)*muO-z*sigstar )/ k 
u= pnorm((b-mu)/(sig/sqrt(n))) 
l= pnorm((a-mu)/(sig/sqrt(n))) 
u-l } 


X11(w=8,h=5.5); par(mfrow=c(1,1)) 

muvec=seq(-5,5,0.01); mu0=1; sig=1; n=10; sigOv=c(0.1,0.2,0.5,1) 

plot(c(-2,2),c(-1,3), typez"n",xlabz" mu", ylabz"",mainz" ") 

abline(1,-1,Ityz3); abline(vz0,Ityz3); abline(hzO,Ityz3) 

lines(muvec, biasfun(mu=muvec,n=n,sig=sig, mu0=mu0,sigO0=sigOv[1]), 
Ity=1,lwd=3) 


67 


Bayesian Methods for Statistical Analysis 


lines(muvec,biasfun(mu=muvec,n=n,sig=sig, muO=mu0,sigO=sigOv[2]), 
Ity=2,lwd=3) 

lines(muvec,biasfun(mu=muvec,n=n,sig=sig, muO=mu0,sigO=sigOv[3]), 
Ityz3,Iwdz3) 

lines(muvec, biasfun(mu=muvec,n=n,sig=sig, muO=mu0,sigO=sigOv[4]), 
Ity=4,lwd=3) 

legend(1,2.8,c("sigO=0.1","sig0=0.2","sigO=0.5","sig0=1.0"), 
Ity=1:4,lwd=rep(3,4)) 

plot(c(-2,2),c(-2,4),type="n"",xlab="mu",ylab="",main=" ") 

abline(v=0,|ty=3); abline(h=0,Ity=3); lines(muvec, mu0/muvec-1,lty=3) 

lines(muvec, biasfun(mu=muvec,n=n,sig=sig, mu0=mu0, sigO-sigOv[1])/muvec, 
Ity=1,lwd=3) 

lines(muvec, biasfun(mu=muvec,n=n,sig=sig, mu0=mu0, sigO-sigOv[2])/muvec, 
Ity=2,lwd=3) 

lines(muvec, biasfun(mu=muvec,n=n,sig=sig, mu0=mu0, sigO-sigOv[3])/muvec, 
Ity=3,lwd=3) 

lines(muvec, biasfun(mu=muvec,n=n,sig=sig, mu0=mu0, sigO=sigOv[4])/muvec, 
Ity=4,lwd=3) 

legend(-2,4,c("sig0=0.1","sigO=0.2","sig0=0.5","sigO=1.0"), 
Ity=1:4,lwd=rep(3,4)) 


plot(c(-1,3),c(0,1),typez"n" xlabz"mu",ylabz"",mainz" ") 

abline(hzc(0,0.95,1),Ityz3) 

lines(muvec, coverfun(mu=muvec,n=n,sig=sig, mu0=mu0,sigO=sigOv[1]), 
Ity=1,lwd=3) 

lines(muvec, coverfun(mu=muvec,n=n,sig=sig, mu0=mu0,sigO=sigOv[2]), 
Ity=2,lwd=3) 

lines(muvec, coverfun(mu=muvec,n=n,sig=sig, mu0=mu0,sigO=sigOv[3]), 
Ityz3,Iwdz3) 

lines(muvec, coverfun(mu=muvec,n=n,sig=sig, mu0=mu0,sigO=sigOv[4]), 
Ityz4,Iwdz3) 

legend(-0.55,0.6,c("sigO=0.1","sig0=0.2","sig0=0.5","sig0=1.0"), 

Ityz1:4,Iwdzrep(3,4)) 


Exercise 2.2 Frequentist characteristics of estimators in the 
normal-gamma model 


Consider the normal-gamma model given by: 
(Qus y, |A) ~ iid N(j,1/ A) 
A ~ Gamma(a, D). 
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(a) Work out general formulae for the frequentist bias and relative bias 
of the posterior mean of c^ —1/A , and for the frequentist coverage 
probability of the 1-a@ CPDR for o°. 


Produce graphs showing examples of each of these three functions. 


(b) Attempt to find a single prior under this model (that is, a single 
suitable pair of values a, f ) which results in both: 
(i) a Bayesian posterior mean of c^ that is unbiased (in the 
frequentist sense) for all possible values of c^ ; and 


(ii) a CPDR for c^ that has frequentist coverage probabilities 
exactly equal to the desired coverage for all possible values 


of o°. 
Solution to Exercise 2.2 


(a) Recall that the posterior mean of o^ is 
6? = E(e^ |y) - ——, 
a-1 


n n 1< 
where: a-ot., b-8*7s;. 722,07. 


Tine Bil. Bc*(n/2)7s,  28-ns, 
i a+(n/2)—1 2a+n-2` 


So the frequentist bias of ó^ is 
— 28--nE(s; |o?) 2. 284ne? 3 


B,zE(ó6^-—o|o^) g = — 
E 20 4 n—2 20 4 n—2 


Note: This follows because, conditional on c^? , it is true that 
2 n 


2 
ae y [24] ~ y?(n) (with mean n). 
c 


Therefore the frequentist relative bias of ó^ is 


"T B, _ Q8/a)tn 


2 3 1. 
z c 2a+n—-2 
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Note: We see that for any fixed c^, a and f it is true that 
B.,R, 70 asn>o. 


Thus the posterior mean of c^ is asymptotically unbiased, in the 
frequentist sense. 


Next, recall that the 1— A CPDR for o? -1/4 is 


28+ns? 28+ns? 
isip S an 
u 
where: v= Cula Tue Pob A12) 
m X: E +n)= Paso 2). 


So the frequentist coverage probability of I is 
E s Plo € I)o’) 


28+ns? 2 4 ns? 
= P| = <o <— _= 
v u 


i 


Figures 2.4, 2.5 and 2.6 (pages 72 and 73) show B» Ro and C for 
selected values of œ and f, with n = 10 and A = 0.05 in each case. 


(b) Observe that under the prior given by « =1 and 5-0 
(that is, f (A) oc Ae ^ oc 1), it is true that: 
e the posterior mean of o^ equals the MLE, namely s^, and so is 
unbiased 


ns? ns? 


* the 1- A CPDR for o° is | 5 ————,———. l 
Xan N+ 2) x; 4501 4-2) 


which has coverage probability less than 1— A for all o°. 
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Also, under the prior given by a= B=0 (i.e. f (4) oc Ate” oc 1/ À), 
it is true that: 
e the posterior mean of c^ equals s? / (1— 2/n) and so is biased 


e the 1— A CPDR for o^ is the same as the classical CI, namely 


ns? ns? 
~~ , —— — |, and so has coverage exactly 1— A for all 
X 5 Ut) Xa2(N) 


2 
Qs 


We see that there is no single gamma prior for 2 =1/ o° which results 
in both: 
(i) a Bayesian posterior mean of c^ that is unbiased (in the 
frequentist sense) for all possible values of c^ ; and 
(ii) a CPDR for c^ that has frequentist coverage probabilities 
exactly equal to the desired coverage for all possible values 
of o°. 


Note: It is easy to modify or ‘correct’ the posterior mean under 
a= -0 so that it becomes unbiased. Explictly, if a= 6 =0, then 
2 
E(6?|o?)= = 


So an unbiased estimate of c? is 
z n-2.,, n-2 0-- (n/ 2)s; 
= = x 


C= ó — s^ (ie. the MLE). 
n n O+(n/2)-1 ^" 
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Figure 2.4 Frequentist bias in Exercise 2.2 
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Figure 2.5 Frequentist relative bias in Exercise 2.2 
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Figure 2.6 Frequentist coverage probability in Exercise 2.2 
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R Code for Exercise 2.2 
biasfun = function(sig2,n=10,alp=0,bet=0){ (2*bet+n*sig2)/(2*alp+n-2)-sig2 } 


coverfun = function(sig2,n=10,alp=0,bet=0,A=0.05){ 
u = qchisq(A/2,2*alp+n); v = qchisq(1-A/2,2*alp+n) 
pchisq(v-2*bet/sig2, n) - pchisq(u-2*bet/sig2, n) } 


X11(w=8,h=5.5); par(mfrow=c(1,1)) 
sig2vec=seq(0.01,5,0.01); n=10; alpv=c(0.1,1,5); betv=c(0.1,1,5) 


plot(c(0,5),c(-2,1),type2"n" xlabz"sigma^2" ylabz"" mainz" ") 


abline(h=0,|ty=3) 


lines(sig2vec, biasfun(sig2=sig2vec,alp=0,bet=0), Ity=1,lwd=3) 

lines(sig2vec, biasfun(sig2=sig2vec,alp=0,bet=1), Ity=2,lwd=3) 

lines(sig2vec, biasfun(sig2=sig2vec,alp=1,bet=0), Ity=3,lwd=3) 

lines(sig2vec, biasfun(sig2=sig2vec,alp=1,bet=1), Ity=4,lwd=3) 

legend(0,-0.5,c("alp=0, bet=0","alp=0, bet=1","alp=1, bet=0","alp=1, bet=1"), 
Ity=1:4,lwd=rep(3,4)) 
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plot(c(0,3),c(-1,6),typez"n" xlabz"sigma^2" ylabz"" mainz" ") 


abline(hzO0,Ityz3); abline(v20,Ityz3) 


lines(sig2vec,biasfun(sig2-sig2vec,alp-0,bet-0)/ sig2vec, Ity=1,lwd=3) 

lines(sig2vec,biasfun(sig2-sig2vec,alp-0,bet-1)/ sig2vec, lty=2,lwd=3) 

lines(sig2vec,biasfun(sig2-sig2vec,alp-1,bet-0)/ sig2vec, Ity-3,Iwd-3) 

lines(sig2vec,biasfun(sig2-sig2vec,alp-1,bet-1)/ sig2vec, Ity-4,Iwd-3) 

legend(1.5,6,c("alp=0, bet=0","alp=0, bet=1","alp=1, bet=0","alp=1, bet=1"), 
Ity=1:4,lwd=rep(3,4)) 


plot(c(0,2),c(0,1),typez"n",xlabz"sigma^2" ylabz"",mainz" ") 
abline(hzc(0,0.95,1),Ityz3) 


lines(sig2vec, coverfun(sig2=sig2vec,n=10,alp=0,bet=0,A=0.05), Ity=1,lwd=3) 

lines(sig2vec, coverfun(sig2=sig2vec,n=10,alp=0,bet=1,A=0.05), Ity=2,lwd=3) 

lines(sig2vec, coverfun(sig2=sig2vec,n=10,alp=1,bet=0,A=0.05), Ity=3,lwd=3) 

lines(sig2vec, coverfun(sig2=sig2vec,n=10,alp=1,bet=1,A=0.05), Ity=4,lwd=3) 

legend(1,0.6,c("alp=0, bet=0","alp=0, bet=1","alp=1, bet=0","alp=1, bet=1"), 
Ity=1:4,lwd=rep(3,4)) 


2.2 Mixture prior distributions 


So far we have considered Bayesian models with priors that are limited 
in the types of prior information that they can represent. For example, 
the normal-normal model does not allow a prior for the normal mean 
which has two or more modes. If a non-normal class of prior is used to 
represent one’s complicated prior beliefs regarding the normal mean, 
then that prior will not be conjugate, and this will lead to difficulties 
down the track when making inferences based on the nonstandard 
posterior distribution. 


Fortunately, this problem can be addressed in any Bayesian model for 
which a conjugate class of prior exists by specifying the prior as a 
mixture of members of that class. 


Generally, a random variable X with a mixture distribution has a density 
of the form 


(OS) G9, 


where each f,,(x) is a proper density and the c„ values are positive and 
sum to 1. 
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If our prior beliefs regarding a parameter @ do not follow any single 
well-known distribution, those beliefs can in that case be conveniently 
approximated to any degree of precision by a suitable mixture prior 
distribution with a density having the form 


f (6) =} cn f. (0). 


It can be shown (see Exercise 2.3 below) that if each component prior 
f,(0) is conjugate then f(@) is also conjugate. This means that 0’s 


posterior distribution is also a mixture with density of the form 
M 
f(601y) - X cn f, (01y); (2.1) 
m=1 


where f,(0| y) is the posterior implied by the mth prior f,(0) and is 
from the same family of distributions as that prior. 


Exercise 2.3 Binomial-beta model with a mixture prior 


(a) Consider the following Bayesian model: 
(y |0) ~ Bin(n, 0) 


f (0) = Kf seaca b) (0) + d= k) f netaa, b) (O) , 
where n, k and the a;, b, are specified constants. 


Note: Here, freaca p(t) denotes the density at t of the beta distribution 
with parameters a and b (and mean a / (a  b)). 


Find the posterior distribution of 0 and shows that 6’s prior is 
conjugate. Then create a figure showing the prior, likelihood and 
posterior for the situation defined by: 

n=5,k=3/4, a, 7 8, b, 2 25, a, = 20, b, = 20 andy = 4. 


Also calculate the prior mean of 0, the posterior mean of 0 and the 
MLE of 0. Then mark these three points in the figure. 


(b) Show that any mixture of conjugate priors is also conjugate and 


derive a general formula which could be used to calculate the mixture 
weights c; in (2.1) above. 
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Solution to Exercise 2.3 
(a) The posterior density is 
f (0 y) « f (8) f Cy |0) 


d4-lr4  gyh-l 4-14  gYyb 
-[ 27 «aco E97 Pa- 


B(a,,b,) B(a,,b,) 


j, Bla + y,b +n y) 


garg = garry | 


B(a,,5,) B(a, + y,b, +n—y) 
- (a +y)—1f4 — gy(by +n—-y)-1 
-a- Seder v} ue 
B(a;,b;) B(a, + y,b, - n— y) 
Thus 
f(1y) «af. (01y) c f(y), 
where: 
e C cial ue A 
B(a,,b,) 
c, =(1 jj tnat) 


B(a,,b,) 


pewa _ gyerm»3 
re | y) fnetata;+yban-y) C ) B(a, JE y, b, +n— y) 


(the posterior density corresponding to 0 ~ Beta(a,,b,) 


,0«0«1 


as prior). 


Now, 

J (61020 —1, 
and so 

f (0 | y)=c litany) Tec) E A (0), 
where 
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Note: This ensures that f f(0| y)d0=cx1+(1—c)x1=1. 


We see that the prior f(0) and posterior f(0| y) are in the same family, 
namely the family of mixtures of two beta distributions. Therefore the 
mixture prior is conjugate. 


For the situation where 
n=5,k=3/4, a, =8, b, = 25, a, = 20, b, = 20 and y = 4, 
we find that: 


* the prior mean is 


go-i| = E o| a | oae 
a, +b, a, +b, 


e the maximum likelihood estimate is 
y/n = 0.8 


* the posterior mean is 


BOME (n biis) E = 0.4772. 


a,+b,+n db +n 


Figure 2.7 shows the prior density f(0), the likelihood function L(0), 
and the posterior density f(0|y), as well as the prior mean, the MLE 
and the posterior mean. 


Note: The likelihood function in Figure 2.7 has been normalised so that 
the area underneath it is exactly 1. This means that this likelihood 
function is identical to the posterior density under the standard uniform 


prior, i.e. under foo) = lena ODE Thus, L(0) = eee fee ate) 


Figure 2.7 also shows the two component prior densities and the two 
component posterior densities. It may be observed that, whereas the 
lower component prior has the highest weight, 0.8, the opposite is the 
case regarding the component posteriors. For these, the weight 
associated with the lower posterior is only 0.2583. This is because the 
inference is being ‘pulled up’ in the direction of the likelihood (with the 
posterior mean being between the prior mean and the MLE, 0.8). 
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Figure 2.7 Densities and likelihood in Exercise 2.3 


— Prior — Component priors © Prior mean 
—  Likelihood **** Component posteriors A MLE 
== Posterior 


X Posterior mean 


density/likelihood 


theta 


(b) Suppose that 0 has a mixture prior of the general form 
M 
f (0) = ae f, (8) 
m=1 


where each /,,(@) is conjugate for the data model. 


Then the posterior density is 


f (0| y) « OFID- Ye. f (y |0) 


= C f, () f(y 16) = Sf. coy (rota 


where f, (y) =f fa (0) f Cy | @)d@ is the unconditional density of the data 
under the mth prior, f, (0). Thus 


f (0| y) « Y k, f. (01 y), 
where 
fn) f Cy | 0) 
fn) 


is the posterior density of 0 under the mth prior, f,(0). 


kn 76, f.) and f, (8| y) 
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It follows that 
M 
f(61y) 9c, f. (1), 
m=1 


where c, =k, /(k,+...+k,)- 


Thus @’s posterior is a mixture of distributions from the same families 
to which the components of @’s mixture prior belong, respectively. This 
shows that @’s mixture prior is conjugate. Note that the component prior 
distributions can be from different classes, so long as each is conjugate 
in relation to its own class. 


R Code for Exercise 2.3 


n=5; k=3/4; a1=8; b1=25; a2=20; b2=20; y=4; thetav=seq(0,1,0.01) 
prior1=dbeta(thetav,a1,b1); prior2=dbeta(thetav,a2,b2) 
post1=dbeta(thetav,ai+y,b1+n-y); post2=dbeta(thetav,a2+y,b2+n-y) 
prior = k*prior1 + (1-k)*prior2 


c1=k*beta(al+y,b1+n-y)/beta(a1,b1); c2=(1-k)*beta(a2+y,b2+n-y)/beta(a2,b2) 
c=c1/(c1+c2); post-c*post1 + (1-c)*post2; options(digits=4); c tt 0.2583 
like=dbeta(thetav,1+y,1+n-y) 4 likelihood = post. under U(0,1)=beta(1,1) prior 


X11(w=8,h=5.5) 

plot(c(0,1),c(0,8),type-"n" xlab-"theta",ylab-"density/likelihood") 

lines(thetav,prior,Ityz1,Iwdz4) 

lines(thetav, like, lty=2,lwd=4) 

lines(thetav, post, lty=3,lwd=4) 

legend(0,8,c("Prior","Likelihood","Posterior"),lty=c(1,2,3),lwd=c(4,4,4)) 

lines(thetav, prior1,lty=1,lwd=2) 

lines(thetav, prior2,lty=1,lwd=2) 

lines(thetav, post1,lty=3,lwd=2) 

lines(thetav, post2,lty=3,lwd=2) 

legend(0.3,8,c("Component priors","Component posteriors"), 
Ity=c(1,3),lwd=c(2,2)) 


mle=y/n; priormean=k*a1/(a1+b1)+(1-k)*a2/(a2+b2) 

postmeanzc*(a14y)/(a1*b1-4n) + (1-c)*(a2+y)/(a2+b2+n) 

points(c(priormean,mle,postmean),c(0,0,0), pch=c(1,2,4),cex=c(1.5,1.5,1.5), 
Iwdzc(2,2,2)) 

c(priormean,mle,postmean) & 0.3068 0.8000 0.4772 

legend(0.7,8,c(" Prior mean"," MLE"," Posterior mean"), 
pch=c(1,2,4),pt.cex=c(1.5,1.5,1.5), pt.lwd=c(2,2,2)) 
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2.3 Dealing with a priori ignorance 


The Bayesian approach requires a prior distribution to be specified even 
when there is complete (or total) a priori ignorance (meaning no prior 
information at all). This feature presents a general and philosophical 
problem with the Bayesian paradigm, one for which several theoretical 
solutions have been advanced but which does not yet have a universally 
accepted solution. We have already discussed finding an uninformative 
prior in relation to particular Bayesian models, as follows. 


For the normal-normal model defined by (y,,..., y, | 4) iid N(u,07) 
and u~ N(u,,0;) , an uninformative prior is given by o =% , that is, 
f(u)xLnuem. 


For the normal-gamma model defined by (y,,..., y, | u) ~ iid N(w,1/ A) 
and À ~ Gamma(a, 9), an uninformative prior is given by «— 5-0, 
that is, f(A) «1/4, A» 0. 


For the binomial-beta model defined by (y|0)- Binomial(n,@) and 
0 ^ Beta(a, 9) (having the posterior (0| y) * Beta(o + y, 8 -- n— y)), 
an uninformative prior is the Bayes prior given by a = fj —1, that is, 
f(09)21,0«0 «1. This is the prior that was originally advocated by 
Thomas Bayes. 


Unlike for the normal-normal and normal-gamma models, more than one 
uninformative prior specification has been proposed as reasonable in the 
context of the binomial-beta model. 


One of these is the improper Haldane prior, defined by a = 3 = 0, or 


fO — 


, 0«0«1. 
0(1— 0) 


Under the prior 0 ^ Beta(a, 3) generally, the posterior mean of 0 is 
(a4 y) EE L6 


ó—E(0|y)— - l 
(a+y)+(8+n-y) a+ß8+n 


This reduces to the MLE y/n under the Haldane prior but not under the 
Bayes prior. In contrast, the Bayes prior leads to a posterior mode which 
is equal to the MLE. 
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The Haldane prior may be considered as being most appropriate for 
allowing the data to ‘speak for itself’ in cases of a priori ignorance. 


However, the Haldane prior leads to an improper and degenerate 
posterior if the data y happens to be either 0 or n. Specifically: 
y=0 => (0| y) ~ Beta(0,n), or equivalently, P(0 —0| y) —1 
y=n => (0| y) * Beta(n,0), or equivalently, P(0 —1| y) 21. 


So in each case, point estimation is possible but not interval estimation. 


No such problems occur using the Bayes prior. This is because that prior 
is proper and so cannot lead to an improper posterior, whatever the data 
may be. Interestingly, there is a third choice which provides a kind of 
compromise between the Bayes and Haldane priors, as described below. 


2.4 The Jeffreys prior 


The statistician Harold Jeffreys devised a rule for finding a suitable 
uninformative prior in a wide variety of situations. His idea was to 
construct a prior which is invariant under reparameterisation. For the 
case of a univariate model parameter 0, the Jeffreys prior is given by 
the following equation (also known as Jeffreys’ rule): 


f (0) < 41(0) , 


where I(0) is the Fisher information defined by 
^ | 


Note 1: If log f(y|@) is twice differentiable with respect to 0, and 
certain regularity conditions hold, then 


;log f (y|0) al. 


Note 2: Jeffreys’ rule also extends to the multi-parameter case (not 
considered here). 


5 2 
I(0) -r |( So fo) 


2 


ô 
o=- 


The significance of Jeffreys? rule may be described as follows. Consider 
a prior given by f(0)œ4/I(0) and the transformed parameter ¢ = g(0), 
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where g is a strictly increasing or decreasing function. (For simplicity, 
we only consider this case.) Then the prior density for ¢ is 


00 
fv fe 
d ô 00 
«ro ) - E (ios roi) ol(22) 
2 


JE log roit) 
riser 


Thus, Jeffreys’ rule is ‘invariant under reparameterisation’, in the sense 
that if a prior is constructed according to 


f (0) «c 41(0) , 


then, for another parameter ¢ = g(0), it is also true that 


f (9) = 4I(9) . 


Exercise 2.4 Jeffreys prior for the normal-normal model 


by the transformation rule 


(Zio f Cy | oj 


I(9). 


Find the Jeffreys prior for u if (y,,..., y, | i) ~ iid N(j,o^) , where o is 
known. 


Solution to Exercise 2.4 


Lu ln 
Here: foro] Te- 
i=1 


1 n 
= 2 = eX faa 
u) | 2 26? 4- 


DO; | 


1 n 
log f(y|) =- m Yo, — u)? +c (where c is a constant) 
O ia 


Žig fol) 320 -aD =E- 


(Zis f Cyl w] =Z F-m. 
u o 
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|e 0-0 7 
Oo 


Hs no n 
= VL = a A 


c^ o n Oo 


Therefore the Fisher information is 


TUD -e (Zos fol 2) 
u 


u 
It follows that the Jeffreys prioris f(u) œ yI (u) = IL xl, u e. 
o 


Note 1: This is the same prior as used earlier in the uninformative 
case. 


Note 2: The Fisher information here can also be derived as follows: 
2 


ôu’ 


log fGli)--— 


e n n 
> 1a) =-E| og ryiao|--e(-4)-4. 


Oo 


Exercise 2.5 Jeffreys prior for the normal-gamma model 


Find the Jeffreys prior for A if (y,,..., y, |A) ~ iid N(u,1/A), where p 
is known. 


Solution to Exercise 2.5 


Here: flae] [4^ ewi- 0. -»| =i RU | 


i=1 i=1 


log f (y| A2) = ;logA - bU ut)’ +c (where c is a constant) 
i=1 


Ologgf(y|2) n 1«, > Ologf(y|À | n 
OA DA 22.0; E EE 241 


So the Fisher information is 


ra= pZ roia] 2 n 
0A 
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So the Jeffreys prior is f (4) oc J/I(A) = 


A 
n x est. 
ÀA A 


Note 1: This is the same prior as used earlier in the uninformative 
case. 


Note 2: Another way to obtain the the Fisher eee is to first write 


ôlog fQ|4)_ n PI 
EE Lha 20r wlt (n-q), 


where: q= AE JE (q|4) » z^ (n), E(q|A) - n, V(q|A) -2n. 


0lo Ar i 


Hi 


i {n° —2nn+[2n+n’ |} = 


We may then write 


and so the Fisher information is I(A) = E ( log - — | 2) 


1 n 
"EE fn? — 2nE(q | A)- E(q' | A)} = ams 
Exercise 2.6 Jeffreys prior for the binomial-beta model 


Find the Jeffreys prior for 0 if (y|0) ^ Binomial(n,0) , where n is 
known. 


Solution to Exercise 2.6 


Hen: roia? ea- 
y 
log f(y|0) = lee(^ | ylogð+(n-y)ioga-0) 
y 


0 " , 
T f(y|0) 20 ya" - (n- 1-0)" 


0g f(y|@)=—-ye™~ -(n-y)a-0) . 
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So the Fisher information is 
2 


ô 
I(0) --e| =p f (y|0) | 
--E(-y6? -(n- ya- 6y?|e) 
= (n9)0? (n - n9)0 - 0)? 
E 1 ) (1-0+6) n 
=n|—+ =n = 


0 1-0) 00-0) 09-0) 


It follows that the Jeffreys prior is given by 


n d 1 
0) cc JJI(0) = oc ,0«0«1. 
f) M1) = aa opi 


Note: We may also write the Jeffreys prior density as 


1 i 
f(0)oc0? (1-0? , 0«0«1. 


Thus the Jeffreys prior can be specified by writing 
0 ^ Beta(a, 8) 
with æ = p =1/2. 


We see that the Jeffreys prior may be thought of as ‘half-way’ between: 


e the Bayes prior, defined by «œ = f 21; and 
e the Haldane prior, defined by a= 5-0. 


Exercise 2.7 Jeffreys prior for the tramcar problem 


Recall the discussion of the tramcar problem following Exercise 1.6, in 
relation to the model (y | 0) ^ DU (1,...,0) . Find the Jeffreys prior for 0. 


Solution to Exercise 2.7 


Here, 
f(y|@)=1/0=0" 
=> log f(y|@)=—logé 


o 1 
28 og f (y|8) ; 
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0 4 
> (Zs foio) CS 


5 2 
=> I(0)- E (Sim f | 2) 


1 
ol- 


It follows that the Jeffreys prior for @ is given by 
f(8)sc JH (8) oc 1/0 .. 0 51,2.3.... 


2.5 Bayesian decision theory 


The posterior mean, mode and median, as well as other Bayesian point 
estimates, can all be derived and interpreted using the principles and 
theory of decision theory. Suppose we wish to choose an estimate of 0 


which minimises costs in some sense. To this end, let L(, 0) denote 


generally a loss function (LF) associated with an estimate à. 


Note: The estimator Ê is a function of the data y and so could also be 
written ó(y). For example, in the context where (y |0) ^ Bin(n,0), the 


sample proportion or MLE is the function given by ó— ó( y= Vite 


The loss function L represents the cost incurred when the true value 0 is 
estimated by Ó and usually satisfies the property L(0,0) — 0. 


The three most commonly used loss functions are defined as follows: 
Lô, 0)= | 6-6 | the absolute error loss function (AELF) 
LÔ, 0)= (ô —6) the quadratic error loss function (QELF) 
0 if6=0 
1 ifó-0 
function (IELF), also known as the zero-one loss function 
(ZOLF) or the all-or-nothing error loss function (ANLF). 


L(0,0) = 1(6 ze i | the indicator error loss 


Figures 2.8 and 2.9 illustrate these three basic loss functions. 
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Figure 2.8 The three most important loss functions 


L(0.0) 16.8) 16.8) 


absolute quadratic zero-one 


eg? 


Figure 2.9 Alternative representation of the absolute error 
loss function 
(The other two loss functions can be represented similarly) 


L(8.6) 
absolute 


Given a Bayesian model, loss function and estimator, we would like to 
quantify what the loss is likely to be. However, this loss depends on 0 
and y, which complicates things. An idea of the expected loss may be 
provided by the risk function, defined as the conditional expectation 


RO) = ECL, 0)16) = | Ly), 6f (v1 Ody. 


The risk function R(0) provides us with an idea of the expected loss 
given any particular value of 0 . Figure 2.10 illustrates the idea. 
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Figure 2.10 The idea of a risk function 


R(0) 


To obtain the overall expected loss we need to average the risk function 
over all possible values of 0 . This overall expected loss is called the 
Bayes risk and may be defined as 


r = EL(0,0) = EE(L(0,0)| 0) = ER(0) = f R(0) f (0)d0 . 
Exercise 2.8 Examples of the risk function and Bayes risk 


Consider the normal-normal model: (y,,...,y, | u) ~ iid N(j,07) 
A N(us, o>) . 


For each of the following estimators, derive a formulae for the risk 
function under the quadratic error loss function: 


RI 
(a) ù =Y =-(y,+..-+y,) (the sample mean) 
n 
(b) fi=|y| (the absolute value of the sample mean). 
In each case, use the derived risk function to determine the Bayes risk. 


Solution to Exercise 2.8 


For both parts of this exercise, the loss function is given by 


(a) If â= y then the risk function is 


R(u) = E{L (fi, u) | ug = EQ — uy | ig =V O | u) 
—o^/n (a constant). 
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So the Bayes risk is simply 
r—ER(u) = E(c^/n) -co^/n (i.e. the same constant). 


(b) If /j =|y| then the risk function is 
RW) =E{ (9-0) la} = Elf -24]]-- n 


= EG? |y) -2uE (|y |ui) + ie 


Ü 


-[- e amu where m = E(ly|u). 
n 


Now, 


m= f cfGuody * f oy tv lway 


=- f xtG Luoay - | HE+ | HF- [ yf Gr lay 
- -2 | HFID ^- | FID 
="-21, where I— | yf(y|u)ay. 


Here, 


—ple 


I= f (u+ cz)ó(z)dz after putting z — YF with c= 


m cin vn 


—ple —ple 


=p [ o(adz+e f zó(z)dz 


—oo 


—ple 


f zo(z)dz . 


=00 


= 10-4 ear. where J = 
C 


1$ Z 
Note: Here, $)- pe? and $(z)— if @(t)dt are the standard 
T = OO) 


normal pdf and cdf, respectively. 
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p 1 1 
= f ———e "dw after substituting werd 


Hence 1- po-o -e|-5|-es [E 


andso m — u — 2I -u-dee[- Ee 
C C 


Therefore 


RE) = 2+ 2p? — 24m -Za yf -24| fal 0 -#]- oof 


l 


Thereby we obtain: 


, HER. 


2 
Ru) - —— + 440° 
n 


EE |- tte 


zo ie 

ol Jn 

The Bayes risk is then given by 
r=ER(W)= | Ro) fdu — f gdy, 


where 
[Estas 
gu) [ean Er oni 


We see that the Bayes risk r is an intractable integral equal to the area 
under the integrand, g(j) = R(p)f(u). However, this area can be 
evaluated numerically (using techniques discussed later). Figures 2.11 
and 2.12 show examples of the risk function R(u) and the integrand 


function g(y). For the case n=o = 4 = o, =1, we find that r = 1.16. 


H — Ho 
% 


Ex 
0 


u 
o | n 
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Figure 2.11 Some risk functions in Exercise 2.8 


R(mu) 


mu 


Figure 2.12 Some integrand functions used to calculate the 
Bayes risk 


wo | 
—— mu0=0, sig0-1.0 => r-3.000 
— - mu0-1,sig0-1.0 => r-1.160 
--- mu0-5,sig0-1.0 => r-0.999 
- — mu0=0, sig0=0.5 => r-1.500 
B 24 
= LI 
E \ 
E * 
X i 
* 
4 1 In each case, n=1 and sig=1 
E w 
5 e 
O S S E 
eo 
"ERR ee a e a 
-5 0 5 10 
mu 
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R Code for Exercise 2.8 


Rfun=function(mu,sig,n){ sig^2/n-4*mu*( mu*pnorm(-mu/(sig/sqrt(n))) - 
(sig/sqrt(n))*dnorm(mu/(sig/sqrt(n)) ) } 
muvec=seq(-10,10,0.01); options(digits=4) 


X11(w=8,h=5.5); par(mfrow=c(1,1)); 
plot(c(-0.5,4),c(0,3),type="n",xlab="mu", ylabz"R(mu)",mainz" ") 


n=1; sig=1; lines(muvec,Rfun(muvec,sig=sig,n=n),lty=1,lwd=3); 
abline(v=0,|ty=3); abline(h=c(0,sig*2/n),|ty=3) 
n=5; sig=2; lines(muvec,Rfun(muvec,sig=sig,n=n),lty=2,lwd=3); 
abline(h= sig^2/n,Ity23) 
n=5; sig=3; lines(muvec,Rfun(muvec,sig=sig,n=n),lty=3,lwd=3); 
abline(h= sig^2/n,Ity-3) 
legend(0.2,3.05,c("sig=1, n=1","sig=2, n=5","sig=3, n=5"), 
Ity=c(1,2,3),lwd=c(2,2,2)) 


Ifun = function(mu,sig,n,mu0,sigO){ 
Rfun(mu=mu,sig=sig,n=n)*dnorm(mu,mu0,sigO) } 


plot(c(-5,10),c(0,1.5),type="n", xlabz"mu",ylabz"g(mu) = R(mu)*f(mu)", 
mainz" ") 

n=1; sig=1; muO0=0; sig0-1 

lines(muvec, Ifun(mu=muvec,sig=sig,n=n, mu0=mu0, sigO=sig0), Ity=1,lwd=3) 
# Check range over which to integrate the integrand 

integrate(f=Ifun,lower=-7,upper=7, sig=sig,n=n,mu0=mu0, sigO-sigO)Svalue 
#3 


n=1; sig=1; mu0-1; sigO=1 

lines(muvec, Ifun(mu=muvec,sig=sig,n=n, mu0=mu0, sigOzsigO),Ityz2,Iwd-z3) 
# Check range over which to integrate the integrand 

integrate(f=Ifun,lower=-7,upper=7, sig=sig,n=n, mu0=mu0, sigO-sigO)Svalue 
#1.16 


n=1; sig=1; mu0-5; sig0-1 

lines(muvec, Ifun(mu=muvec,sig=sig,n=n, mu0=mu0, sigOzsigO),Ityz3,Iwd-z3) 
# Check range over which to integrate the integrand 

integrate(f=Ifun,lower=0,upper=10, sig=sig,;n=n, mu0=mu0, sigO-sigO)Svalue 
# 0.9994 
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n=1; sig=1; mu0-0; sig0-0.5 

lines(muvec, Ifun(mu=muvec,sig=sig,n=n, mu0=mu0, sigO-sigO),Ityz4,Iwdz3) 
# Check range over which to integrate the integrand 

integrate(f=lfun,lower=-5,upper=5, sig=sig,n=n, muO0=mub0, sigO-sigO)Svalue 
#1.5 


legend(1,1.5,c("mu0-0, sigO=1.0 => r=3.000", "mu0=1, sig0=1.0 => r=1.160", 
"mu0=5, sig0=1.0 => rz0.999","mu0-0, sig0z0.5 => r=1.500"), 
Ity=c(1,2,3,4),lwd=c(3,3,3,3)); text(5,0.6,"In each case, n=1 and sig=1") 


2.6 The posterior expected loss 


We have defined the risk function as the expectation of the loss function 
given the parameter, namely 


RO) = E(LO,9)|0) = f LODO Ody. 


Conversely, we now define the posterior expected loss (PEL) as the 
expectation of the loss function given the data, and we denote this 
function by 


PEL(y) = E{L(6,9)| y — f L(y), 8) F (61 y)a0. 


Then, just as the risk function can be used to compute the Bayes risk 
according to 


r = EL(6,0) = EE(L(0,0)|0) = ER(0) = f R(0) f (0)d0 , 
so also can the PEL be used, but with the formula 
r = EL(6,0) = EE{L(6,0)| y) — E{PEL(y)}= [ PEL(y) f (dy. 


Note: Both of these formulae for the Bayes risk use the law of iterated 
expectation, but with different conditionings. 


Exercise 2.9 Examples of the PEL and Bayes risk 


Consider the normal-normal model: 
Qu Y, |H) ~ tid N(u o?) 
A N (ti, 0$) . 
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For each of the following estimators, derive a formula for the posterior 
expected loss under the quadratic error loss function: 


"— l 
(a) ł=y= m c.c y) (the sample mean) 


(b) j —|y| (the absolute value of the sample mean). 
In each case, use the derived PEL to obtain the Bayes risk. 


Note: This exercise is an extension of Exercise 2.8. 


Solution to Exercise 2.9 


(a) If £i = y then the PEL function is 
PEL(y) = E{L (ji, u)| y) 
= E((y - ny |y) 
-y'-2yE(u|y)-- E(u" |y), 


where: 
E(u] y) = n. 
E(u" | y) - V(u| y) HEI y) 
—oicgul 
2 
p.—-ü-b thy, =k, k-— p 
n nro /0, 


Thus, more explicitly, 
PEL(y) = Y* -2y (AI) c y] +02 +{0-k) u + Kn 


—y!-2ü-k)yuy — 2Ny! +0? c0 — ky ul 20 — k) ky +k’? 
= y'ü- Kk) - ya- k) 2u toe + (1k)? us 
= 0? +(1-K} F- uy. 


Note: This is a quadratic in y with a minimum of o; at y = ji. 
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The Bayes risk is then 
r= E{PEL(y)} 
=o. +(1— k) E(y — uy) y. 
where 
Et(y — i) =V 
= EV(y | u) - VECy | u) 
2 
= [eva 
n 


2 
Thus r=o; n] 
n 
2 


2 
kac] (where k =" 
n n 


n+o°/o, 
2 
Ta (after a little algebra). 
n 


Note: This is in agreement with Exercise 2.8, where the result was 
obtained much more easily by taking the mean of the risk function, as 
follows: 


r= ER(u)= E(c? / n) c /n. 


(b) If /j =|y| then the posterior expected loss function is 
PEL(y) = E((y|- ny |y] 
-y -2|y|EQ.y) EW | y) 
ay =a 
= y! -2|y|(a— Iu - Ky] +02 - (a7 I0 - y] - 


p. T 02 + uz 


Some examples of this PEL function are shown in Figure 2.13. In all 
these examples, n =ø =1. 
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Figure 2.13 Some posterior expected loss functions 
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In terms of the PEL function, the Bayes risk can be expressed as 
r = E(PELQ)) — | PELG) f Gy, 


where 


119,06 +07 [n 


since 


2 
yon woi] 
n 


As an example, we consider the case n= o = 4, = o, =1. Figure 2.14 
shows the integrand function PEL(y) f (y). The area under this function 


works out as 1.16, in agreement with an alternative working for the 
Bayes risk in Exercise 2.8 (taking an expectation of the risk function). 
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Figure 2.14 An integrand function with area underneath equal 
to 1.16 
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R Code for Exercise 2.9 


PELfunzfunction(ybar,sig,n,sigO,muO)( 
k=n/(n+sig*2/sig0*2) 
mustar=(1-k)*mu0+k*ybar 
sigstar2-k*sig^2/n 
ybar^2-2*abs(ybar)*mustar-sigstar2 + mustar^2 


} 


ybarvec=seq(-10,10,0.01); options(digits=4) 
X11(w=8,h=5.5); par(mfrow=c(1,1)); 


plot(c(-4,5),c(0,3),type-"n" xlabz"ybar",ylabz"PEL(ybar)", mainz" ") 
abline(vz0,Ityz3); abline(h=0,Ity=3) 


n=1; sig=1; mu0=0; sig0-1 
lines(ybarvec, PELfun(ybarvec,sig=sig,n=n,sigO=sig0, mu0=mu0),Ity=1,lwd=3); 


n=1; sig=1; mu0-1; sigO=1 
lines(ybarvec, PELfun(ybarvec,sig=sig,n=n,sigO=sig0, mu0=mu0),Ity=2,lwd=3); 
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n=1; sig=1; mu0--0.5; sigO=1 
lines(ybarvec,PELfun(ybarvec,sig=sig,n=n,sigO=sig0, mu0=mu0),Ity=3,lwd=3); 


n=1; sig=1; muO0=0; sig0-2 
lines(ybarvec, PELfun(ybarvec,sig=sig,n=n,sigO=sig0, mu0=mu0),Ity=4,lwd=3); 


legend(-4,1.5,c("mu0=0, sig0z1","mu0-1, sigO=1","mu0=-0.5, sigO=1", 
"mu0=0, sig0=2"), Ityzc(1,2,3,4), lwd=c(3,3,3,3)) 


# Calculate r when n=1, sig=1, mu0-1, sigO=1 (should get 1.16 as before) 


Jfun = function(ybar,sig,n,sigO,mu0){ 
PELfun(ybar=ybar,sig=sig,n=n,sigO=sig0, muO0=mu0)* 
dnorm(ybar,muO,sqrt(sig0^2-sig^2/n)) 

) 


n=1; sig=1; mu0-1; sig0=1 


plot(ybarvec, PELfun(ybar=ybarvec,sig=sig,n=n,sigO=sig0,mu0=mu0)* 
dnorm(ybarvec,muO,sqrt(sig0^2-sig^2/n)), 
type="I", xlab="ybar", ylab="PEL(ybar)*f(ybar)", lwd=3) 


integrate(f=Jfun,lower=-10,upper=10, sig=sig,n=n, mu0=mu0, sigO-sigO)Svalue 
#1.16 Correct (same as in last exercise) 


2.7 The Bayes estimate 


The Bayes estimate (or estimator) is defined to be the choice of the 
function 6 = (y) for which the Bayes risk r — EL(0,0) is minimised. 
This estimator has the smallest overall expected loss over all estimators 
under the specified loss function LÔ, 0). 


In many cases, the procedure for finding a Bayes estimate can be 
considerably simplified by considering which estimate minimises the 


posterior expected loss function, PEL(y) = E (L(0, O)| yj. 


If we can find an estimate 6 = (y) which minimises PEL(y) for all 


possible values of the data y, then that estimate must also minimise the 
Bayes risk. 
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This is because the Bayes risk may be written as a weighted average of 
the PEL, namely 


r = EL(6,6) = EE{L(6,0)| y) = E{PEL(y)}= [ PELO) f(y)dy. 
Exercise 2.10 Bayes estimate under the QELF 


Find the Bayes estimate under the quadratic error loss function. 


Solution to Exercise 2.10 


Observe that PEL(y) = E((0 - 8y | yy = E{@’ -280 + 0° | y) 
= 6 -20E(0| y) - E(9" | y) 


^ 2 
=| 6-E(@|y)| -t&(0| y) +E(@ | y). 
Note: We have completed the square in ô. 


We see that the PEL is a quadratic function of Ô which is clearly 
minimised at the posterior mean, 6 = E(0| y). So the Bayes estimate 
under the QELF is that posterior mean. 


Note 1: This result can also be obtained using Leibniz’s rule for 
differentiating an integral, which is generally 


d} 7 OG(u, x) db da 
p |e +G(b,x) = G(a x) 


b 


and which reduces to y PC qu --0—0 ifa and b are constants. 
x 


ô 0 p.a 
Thus we may write — PEL(y) - — | (0 —0y f (0| y)d0 
ywrite — PEL() - — [(6-0Y Aly) 


- [2z(6- ey [(81y)| d6« 0-0 


= [2(6-0)' F(@|y)do -2(8- ferce1yye]. 


Setting this to zero yields Ê= [0f (0| y)d0 = E(0| y). 
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Note 2: To check that this minimises the PEL (rather than maximises it) 
we may further calculate 
e Ó (5 
= PEL(y)-2—:10- |0f(0| y)d0; 2211-0,»0. 
zar PELO) - 2708- [f (81 y)d0} - 2 (1-0) 


Thus the slope of the PEL ( OPEL(y)/ 00 ) is increasing with 6, 
implying that PEL(y) is indeed minimised at 6 = a y)=E(@| y). 


Exercise 2.1 | Bayes estimate under the AELF 


Find the Bayesian estimate under the absolute error loss function. 


Solution to Exercise 2.11 


Suppose that the parameter @ is continuous, and let t denote 6 = é( y). 
Then PEL(y)= fire f (0| y)d0 

= feo ACA pde f 0-0) f (0| y)d0 . 
So, by Leibniz's rule for differentiation of an integral (in Exercise 2.10), 


9 | | f9(—0) 7 B dt , d(—oo) 
Pon - f A^ f (0| y)d0 +{(t DIAM S Q ai | 


{fe fijo «98 69.. -op- ift 


t 


4j feodo co-o i [fcre o-o| 
—P(oetlyy--bisst yy. 


Setting this to zero implies P(0 <t| y) — P(0 »t| y) which yields t as 
the posterior median. So the Bayes estimate under the AELF is the 
posterior median. This argument can easily be adapted to the case where 
0 is discrete. The idea is to approximate @’s discrete prior distribution 
with a continuous distribution and then apply the result already proved. 
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Exercise 2.12 Bayes estimate under the IELF 


Find the Bayes estimate under the indicator error loss function. 


Solution to Exercise 2.12 


Let t denote Ô = ó( y) and first suppose that the parameter @ is discrete. 
The indicator error loss function is L(t,?)=I(t #0) —1— I(t—0). 
Therefore 
PEL(y) = E(L(t,0)| y) = E(L— I(t—0)| y] -1— EU (t 0)| y) 
—1- P(t —0|y) 
—1- f(6—t|y). 


Thus PEL(y) is minimised at the value of t which maximises the 
posterior density f (0| y). So, when @ is discrete, the Bayes estimate 
under the IELF is the posterior mode, Mode(0 | y). 


Now suppose that @ is continuous. In that case, consider the 
approximating loss function 

L.(t,9) Z1—I(t—2 «0 «t--£), 
where e > 0, and observe that 

lim L. (¢,6) —1-I(t =0)= L(t,0). 


The posterior expected loss under the loss function L (t,0) is 
PEL (y) = E(L (50)| y -1- EU(t-& «0 <t+e)| y} 
—1-P(t-&«0«t-e|y). 


Ihe value of t which minimises the PEL (y) is the value which 
maximises the area P(t—& «0 «t--e| y). But in the limit as € — 0, 


that value is the posterior mode. So, when @ is continuous, the Bayes 
estimate under the IELF is (as before) the posterior mode, Mode(0 | y). 


Note: To clarify the above argument, observe that if € is small then 
PEL (t) z:1— 2e f, (t| y). 


This function of t is minimised at approximately t = Mode(0| y) and at 
exactly t = Mode(0 | y) in the limit as € — 0. Figure 2.15 illustrates. 
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Figure 2.15 Illustration for the continuous case in Exercise 2.12 


f(01y) largest possible strip under 
posterior with width 24 


Exercise 2.13 Bayesian decision theory in the Poisson-gamma 
model 


Consider a random sample y,,..., y, from the Poisson distribution with 
parameter À whose prior density is gamma with parameters o and 5. 


(a) Find the risk function, Bayes risk and posterior expected loss implied 
by the estimator A = 2y under the quadratic error loss function. 


(b) Assuming quadratic error loss, find an estimator of A with a smaller 
Bayes risk than the one in (a). 


Solution to Exercise 2.13 


(a) The risk function is 
RQ) = E{L(A,A)| A} 


=E{(27-A) A} 
= E{4y’—4y\+ la} 
-4E[y? A] - Ag (yp) +? 


= alv (9d + EG |- aE (sp) e» 


= ao |- e je 
n 
—A -4A/n, \>0 (an increasing quadratic). 
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So the Bayes risk is 
r = ER(),)) 
= E(\* + 4A/n) 
— (VA-4- (EA) --A(EA)/n 
a i 4a 


5) 5 


To find the posterior expected loss, we first derive A's posterior density: 
f (Aly) e FAFA) 
pacte n e^ 
Tr(a) Ei J! 


À 
x ye Hpy-6-A(8*n) 


We see that 
f (A| y) ~ Gam(a ny, B +n). 


It follows that . 

PEL(y) = E{L(A,A)| y} 
=E{(27-2)'|y} 
=E{4y’—4y\+ ly} 
= 4y! — 4yE(\| y) + EQ! | y) 


eg. a 4 ny 


TUUM B+n (8 4- ny 


—\2 
E eue | 
Bn 
Note: The Bayes risk could also be computed using an argument which 


begins as follows: 
aye à 
Q 
ar , 
B+n | | 


r = E{PEL(y)} 


nz] a T ny 


where, for example, 
Ey = EE(y |^) = EE(y,|\)= EA=a/ B. 
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(b) The Bayes estimate under the QELF is the posterior mean, 


a4 ny 
E(|y)- a 


This estimator has the smallest Bayes risk amongst all possible 
estimators, including the one in (a), which is different. So E(A| y) must 
have a smaller Bayes risk than the estimator in (a). 


Discussion 


The last statement could be verified by calculating r according to 


for all n = 1,2,3,..., and all a,G>0. 
We leave the required working as an additional exercise. 
Exercise 2.14 A non-standard loss function 
Consider the Bayesian model given by: 

VIM * N(u,1) 

u ~ ND. 


Then suppose that the loss function is 
0 if0<u<t<2u 
L(t, u) = ; 
1 otherwise. 


(a) Find the risk function and Bayes risk for the estimator (ij — y . 


Sketch the risk function. 
(b) Find the Bayes estimate and sketch it as a function of the data y. 


Explicitly calculate the Bayes estimate at y = —1, 0 and 1, respectively. 
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Solution to Exercise 2.14 


(a) For convenience we will sometimes denote /; — y by t. Then, the 
loss function may be written as 
1—I(u<t<2u)ů) u>0 
L(t, u) = | 
1, u <o. 


Now, for u <0 the risk function is simply 
R(u) = E{L(y, u) | id — 1. 


For p> 0, the risk function is 
R(p) -C1- P(u« y «2u|pg) =1-PO<y-p<p|p) 
=1—P(0<Z <u) where Z~ N(0,1) 
—1-($(u)-1/2) =1.5—8(u). 


1 „<o 


In summary, R(u) = 
Y: Ru) RS 


| as shown in Figure 2.16. 


Figure 2.16 Risk function in Exercise 2.14 
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04 0.8 1.0 


0.2 


0.0 
l 


105 


Bayesian Methods for Statistical Analysis 


The associated Bayes risk is 


r—ER()- f odu +Ž fowdu—f oqooudu 


1 
where T = I wdw = 3/8, after putting w= (ji) with — = ¢(u). 
u 


1/2 


So, for the estimator f; = y , the Bayes risk is 
|] 3 1 3 7 
ra^ i 
2 2 2 8 8 
(b) Here, by the theory of the normal-normal model we have that 
(u| Y) ~ Nu, o), 
where: 
tu =(1—-k)u, +K, of =ko*/n, k=1/1+0° /(no;)) 
n=1, j44,-—-0, o,=1, y-—y. 


Thus k = 1/2, u, = y/2 and o? =1/2, and so 
(u] y) * N(y/2,1/2). 


The posterior expected loss is 
PEL(y) = E{L(t, u)| y}, 
where t is a function of y (i.e. t — t(y)). 


Now 
Ltt, u) -1—I(0 « u «t « 2p), 
and so 
PEL(y) 2 E(1— I(0« un «t «2p)| y} 
=1-P(0<p<t<2p|y). 


We see that if t — t(y) <0 then 
PEL(y) —1. 
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Also, if t 2 0 then 
PEL(t) -1— EU (0« u «t «2p)| y}. 
=1—P(0<p<t<2p|y) 
=1=P(t/2<p<t|y) 
=1-y(0), 
where 
Y(t) = F(u —t| y) - F(u —t/2| y) 
is to be maximised. 


Now, v'(— f(u-tly)- f(u=t/2|y)x1/2 


Lol qo yy 
NAN y/2) 1 1 X ((t/2)—y/2)? 


1 
— Q/42) 42x wd» 


Setting «/'(t) to zero we obtain 


2e yl — p Dy 


vpn] fal GT 


2 
Y, ly 3 
+ --4x -log2 
2 " P id 


—t- 
2(3/4) 


Hence we find that the Bayes estimate of u is given by 


=A= sre +12log2], 


as shown in Figure 2.17. 
We see that the Bayes estimate is a strictly increasing function of y and 


converges to zero as y tends to negative infinity. The required values of 
the Bayes estimate are: 


i2 =—(-1+ /iF1210g2) - 0.6842 
f(0) =— z(0-- Jo-c1219g2 ) = 0.9614 


Al) = (I+ 1+12log2) = 1.3508. 
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Figure 2.17 Bayes estimate in Exercise 2.14 


Bayes estimate 


R Code for Exercise 2.14 

X11(w=8,h=5.5) 

muvec <- seq(0,5,0.01) ; Rvec <- 1.5-pnorm(muvec); 
plot(c(-2,5),c(0,1.1),type= "n",xlab="mu",ylab="R(mu)",cex=1.5) 
lines(muvec,Rvec,Iwdz2) ; lines(c(-2,0),c(1,1),lwd=2) 

yvec <- seq(-30,10,0.01); muhatvec <- (1/3)*(yvec*sqrt(yvec^2 + 12*log(2))) 
plot(yvec,muhatvec,type="I",xlab="y",ylab="Bayes estimate",cex=1.5,lwd=2) 


abline(h=0,|ty=2) 


(1/3)*(c(-1,0,1)+sqrt(c(-1,0,1)42 + 12*log(2))) 
# 0.6841672 0.9613513 1.3508339 
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3.1 Inference given functions of the data 


Sometimes we observe a function of the data rather than the data itself. 
In such cases the function typically degrades the information available 
in some way. An example is censoring, where we observe a value only if 
that value is less than some cut-off point (right censoring) or greater than 
some cut-off value (left censoring). It is also possible to have censoring 
on the left and right simultaneously. Another example is rounding, 
where we only observe values to the nearest multiple of 0.1, 1 or 5, etc. 


Exercise 3.1 Right censoring of exponential observations 


Each light bulb of a certain type has a life which is conditionally 
exponential with mean m=1/c, where c has a prior distribution which 
is standard exponential. We observe n = 5 light bulbs of this type for 6 
units of time, and the lifetimes are: 

2632 de, 
where * indicates a right-censored value which is greater than 6. (Only 
values less than or equal to 6 could be observed.) 


Find the posterior distribution and mean of the average light bulb 
lifetime, m. 


Solution to Exercise 3.1 


The data here is 
D=({y, =2.6, y, =3.2, y, > 6, y, =1.2, y, > 6), 
and the probability of censoring is 


P(y, > 6|c)= [ce dy, mp. 
6 


Therefore the posterior density of c is 
f (c| D) « Fle) f(D |c) 
æ fF fO 1) FO LPO > 610) f Gr, |P, > 610) 
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se^ (ee7 ee Je ero" 


- c' exp{—c(1+ y, + y, +6+ y, +6)} 
= c^" exp(-c(14 2.6 - 3.2 - 6-1.2 4 6)) 
= c*lexp(-20c). 


Hence: (c|D)^ G(4,20) 
(m | D) ^ IG(4,20) 
f (m|D) 220^ m “e r/ (4), m — 0 
E(m |D) = 20/ (4—1) = 6.667. 


It will be observed that this estimate of m is appropriately higher than 
the estimate obtained by simply averaging the observed values, namely 
(1/3)(2.6 + 3.2 + 1,2) = 2.333. 


The estimate 6.667 is also higher than the estimate obtained by simply 
replacing the censored values with 6, namely 
(1/3)(2.6 + 3.2 + 6 + 1.2 + 6) = 38. 


Exercise 3.2 A uniform-uniform model with rounded data 


Suppose that: 
(y|0) * U(0,0) 
0 00.2). 
where the data is 
x = g(y) = the value of y rounded to the nearest integer. 


Find the posterior density and mean of 0 if we observe x = 1. 
Solution to Exercise 3.2 
Observe that: 

x=0 if 0<y<1/2 

x=1 if 1/2«y « 32 

x=2 if 3/2«y «2. 


Therefore, considering y and 0 on a number line from 0 to 2 in each 
case, we have that: 
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: 1 if 9«1/2 
P(x=0|0)=P| 0<y<—@ |= 
a | : sf) U gosia 


0 if0<0<1/2 
pex=116)=P[3<y<3]o)- EOS fee 


pou 
2 


0 
3 0  if0«0«3/2 
ee 2219 P[3«y «3 ]- 0-3 


if Tepen 
2 


Since we observe x = 1, the posterior density of 0 is 


0-1/2 1 3 
1x ,—«0« 
f(O1x-2 OLED x} | d : ? 
1x—, —«0«2. 
0 2 


Now, the area under this function is 
3/2 2 
B-| 22 did: | Lao 
CHE 0 


3/2 


3/2 
-|6--logó| |«|logó[, | 
i 1/2 
[3 de 3 1 d 
= lo +—lo +| log 2 — log — 
|2 2 eo 2 2 5 | | : : | 
= 0.7383759. 


So the required posterior density is 


0-1/2 1 3 
BO Ii kd 
f(0|x-1)- i 


BO’ 


2 edu. 
2 


and the associated posterior mean of @ is 


Bayesian Methods for Statistical Analysis 


3/2 2 
E, =E(0|x=1)= fo" Jae | of Jao 
1/2 B0 3/2 B0 


= z —1.354 (after some working). 


Discussion 


In contrast to. f (0 | x), the posterior density of 0 given the original data 
yis 

0 0 1/2)/8 1 
roi Ifl __ qi20/0 — 


=> = ——___———., y< 0<2, 
f(y) [./2)0/6)00 0(log2 —log y) 


and the corresponding posterior mean is 


2 
-reife : Jao- at 
, \82dog2-log y) log 2—log y 


Figure 3.1 shows f(@|x=1) and examples of f(@|y) which are 
consistent with x = 1. 


Figure 3.1 Posteriors given x = | and given y = 0.6, I, 1.1, 1.4 


20 


iN 
— {(theta|x=1) Ne 
— - {(thetaly=0.6) s 
--- f(thetaly=1) Bu 
2 | |-— fithetaly=1.1) i Er 
— f(thetaly=1.4) ~ 
r= 
o e 
c x- N 
v 
Ee] 
2 | 
e 
e | 
e 
T T T T T 
0.0 0.5 10 15 20 
theta 
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It is now of interest to also calculate f(0|x) for the other two possible 
values of x, namely 0 and 2. We find that: 


zx 0<0<= 
f(@lx==)  , 

——, —«0«2 

2A0 2 


where A= I slog? e = 1.1931 
2 2 202 
1 S 
0|x22)-—!1-— |,—«0«2 
(eix-3-z6- 33 


where Dens gui uS ne = 0.068477. 
2 2 2 2 


Figure 3.2 shows these two posteriors, and further examples of f (0| y). 


Figure 3.2 Posteriors given x = 0, I, 2, and given y = 0.1, ..., 1.9 


^ 

— = f(thetajx-0) ot 

—— f(theta|x=1) b : 

aa +++ f(thetalx=2) o o’ 


—— {(thetaly) 


E 


density 


0.0 0.5 1.0 1.5 2.0 


theta 


For completeness and checking we now also calculate the other two 
posterior means: 


E, -E(|x-0)-—— 20.7334 
8A 
1 


E,=EO|x=2)= 
je pede. 


= 1.8254, 
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as well as the unconditional probabilities of the data: 


P, = P(x=0)=P(y<3)=EP[y<s|0|=JP{y<3e) f (6)d6 


1/2 
p «Lao, [Wd Es H1+1og2- log L) = 0.5966 
1/2 


P =P(x=1) = 0.3692 
P, = P(x = 2) = 0.0342. 


As a check on our calculations, we note that 
P,+P,+P, =1 (which is correct). 


We may also calculate the prior mean of @ (which is obviously 1) as 


E@ = EE(0 |x) 
= E(0|x =0)P(x =0)+ E(0|x =1I) P(x =1) + E(0| x 2 2)P(x=2) 
= E,P, + EP + EP, 


= 0.7334 x 0.5966 + 1.354 x 0.3692 + 1.825 x 0.03424 
— 1.000 (correct). 


R Code for Exercise 3.2 


X11(w=8,h=5.5); par(mfrow=c(1,1)); options(digits=7) 

B=1.5-0.5*log(3/2)-0.5+0.5*log(0.5)+log(2)-log(1.5); c(B,1/B) 
# 0.7383759 1.3543237 

postfunB= function(theta,B=0.7383759){ res=0; 
if((theta>=1/2)&&(theta<3/2)) res=1-1/(2*theta) 
if((theta>=3/2)&&(theta<=2)) res=1/theta 
res/B } 


thetavec = seq(0,2,0.001); postvecB=thetavec; 
for(i in 1:length(thetavec)) postvecB[i]-postfunB(theta-thetavec[i]) 
plot(c(0,2),c(0,2), typez"n",xlabz"theta",ylabz"density", mainz" ") 
lines(thetavec, postvecB,lwd=3) 
y=0.6; k=1/(log(2)-log(y)) 

lines(thetavec[thetavec>y],k/ thetavec[thetavec>y], Ity=2,lwd=3) 
y=1; k=1/(log(2)-log(y)) 

lines(thetavec[thetavec>y],k/ thetavec[thetavec>y], Ity=3,lwd=3) 
y=1.1; k-1/(log(2)-log(y)) 

lines(thetavec[thetavec>y],k/ thetavec[thetavec>y], Ity=4,lwd=3) 
y=1.4; k-1/(log(2)-log(y)) 

lines(thetavec[thetavec>y],k/ thetavec[thetavec>y], Ity=5,lwd=3) 
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legend(0,2,c("f(theta|x=1)","f(theta | y=0.6)","f(theta | y=1)","f(theta | y=1.1)", 
"(theta | y=1.4)"), Ity=c(1,2,3,4,5), lwd=c(3,3,3,3,3)) 


C=2-1.5*log(2)-1.5+1.5*log(1.5) 
A=0.5+0.5*log(2)-0.5*log(0.5) 
options(digits=7); c(A,B,C) # 1.19314718 0.73837593 0.06847689 
E0-7/(8*A); E1-1/B; E2=1/(8*C); c(EO,E1,E2) 
# 0.7333546 1.3543237 1.8254333 


PO=1/4+(1/4)* (log(2)-log(1/2)) 
P1=0.5*(1.5-0.5*log(1.5)-0.5+0.5*log(0.5)) +0.5*(log(2)-log(1.5)) 
P2=0.5*(2-1.5*log(2)-1.5+1.5*log(1.5)) 


PO+P1+P2 # 1 Correct 
c(PO,P1,P2) # 0.59657359 0.36918796 0.03423845 
EO*PO + E1*P1 + E2*P2 # 1 Correct 


postfunA= function(theta,A-1.19314718)( res=0; 
if((theta>=0)&&(theta<1/2)) res=1 
if((theta>=1/2)&&(theta<=2)) res=1/(2*theta) 
res/A } 

postfunC= function(theta,C=0.06847689){ res=0; 
if((theta>=3/2)&&(theta<2)) res=1-3/(2*theta) 
res/C } 


postvecA=thetavec; postvecC=thetavec; 
for(i in 1:length(thetavec)){ postvecA[i]=postfunA(theta=thetavec[i]) 
postvecC[i]-postfunC(theta-thetavec[i]) } 
plot(c(0,2),c(0,3.7),type="n"",xlab="theta",ylab="density", mainz" ") 
lines(thetavec, postvecA,|ty=2,lwd=3) 
lines(thetavec, postvecB, Ity=1,lwd=3) 
lines(thetavec, postvecC, Ity=3,lwd=3) 
for(y in seq(0.1,1.9,0.1)){ k=1/(log(2)-log(y)) 
lines(thetavec[thetavec>y],k/ thetavec[thetavec>y], Ity=1,lwd=1) } 


legend(0.7,3.6,c("f(theta | x=0)","f(theta | x=1)","f(theta | x=2)","f(theta | y)"), 
Ity=c(2,1,3,1), lwd=c(3,3,3,1)) 
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3.2 Bayesian predictive inference 


In addition to estimating model parameters (and functions of those 
parameters) there is often interest in predicting some future data (or 
some other quantity which is not just a function of the model 
parameters). 


Consider a Bayesian model specified by f(y|@) and f(0), with 
posterior as derived in ways already discussed and given by f (0| y). 


Now consider any other quantity x whose distribution is defined by a 
density of the form f (x| y,0). 


The posterior predictive distribution of x is given by the posterior 
predictive density f(x|y). This can typically be derived using the 
following equation: 


f(xy) =f f6.61»)a0 
=| f(x1y,0)f (0 | y)dð. 


Note: For the case where Ó is discrete, a summation needs to be 
performed rather than an integral. 


The posterior predictive density f(x|y) forms a basis for making 
probability statements about the quantity x given the observed data y. 


Point and interval estimation for future values x can be performed in 
very much the same way as that for model parameters, except with a 
slightly different terminology. 


Now, instead of referring to X = E(x |y) as the posterior mean of x, we 
may instead use the term predictive mean. 


Also, the ‘P’ in HPDR, and CPDR may be read as predictive rather than 
as posterior. For example, the CPDR for x is now the central predictive 


density region for x. 


As an example of point prediction, the predictive mean of x is 
R=E(x|y) 2 [fc ndx. 
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Often it is easier to obtain the predictive mean of x using the equation 
X= E(x| y) = E{E(x| y, 0)| y} 


=| E(x|y,0) f (0| y)dð. 


Note: The basic law of iterated expectation (LIE) implies that 
E(x) = EE(x|0). This equation must also be true after conditioning 
throughout on y. We thereby obtain E(x | y) = ELE(x| y,0) | y}. 


Likewise, the predictive variance of x can be calculated via the equation 
V(x|y) = EiV(x|y,0)] y -VtE(x|y,0)| ys. 


Note: This follows from the basic law of iterated variance (LIV), 
Vx = EV(x |0) - VE(x | 0), after conditioning throughout on y. 


An important special case of Bayesian predictive inference is where the 
quantity of interest x is an independent future replicate of y. 


This means that (x| y, 0) has exactly the same distribution as (y|0), 
which in turn may be expressed mathematically as 

(x|y,0) ~ (y|8) 
or equivalently as 


fly D= f(y-x|0)-| f(y16) 


bu 


Note: The last equation indicates that the pdf of (x| y, 0) is the same as 
the pdf of (y |0) but with y changed to x in the density formula. 


In the case where x is an independent future replicate of y, we may write 
f (x| y,0) as f(x|0), and this then implies that 


f(xly)=| FOIA fre. 
Exercise 3.3 Prediction in the exponential-exponential model 


Suppose that @ has the standard exponential distribution, and the 
conditional distribution of y given @ is exponential with mean 1/0. 


Find the posterior predictive density of x, a future independent replicate 
ofy. 
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Then, for y = 2.0, find the predictive mean, mode and median of x, and 
also the 8096 central predictive density region and 8096 highest 
predictive density region for x. 


Solution to Exercise 3.3 


Recall that the Bayesian model given by: 
f(y|0) 0e", y>0 
f(0)2e^*,0-0 

implies the posterior (0 | y) ~ Gamma(2, y +1). 


Now let x be a future independent replicate of the data y, so that 
f(x|y,0) 5 f(x|0) — f(y 9 x|0) 20e ^, x» 0. 


Then the posterior predictive density of x is 


fG1yy f Fly.) fc 1)d0 
_ fen |e 417262 te 04D Je 


I2) 
|. T3Xy 43 aaa " 
T'(2)(x 4 y+? A r(3) 
_ yt) 
(x+y +D?’ 
On 
Check: Mivides -Aia 
f fe» f usum 
oo+y+1 
0+y+1 
—9|es 
~ © lu=y4t oo (y+ 


Next, suppose that y = 2. Then 
f (x| y) 218(x-3)?, x» 0. 


This is a strictly decreasing function, and so the predictive mode is zero. 
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The predictive mean can be calculated according to the equation 


oo 


E(x |y) = f x18(x+3) ?ax. 


0 


An easier way to find the predictive mean is to note that 
(0| y) ~ Gamma(2, 3) 

and then write 

i Ee 7 3° 02 1g 9? 
E(x| y) - Etc yy) - EO" 1) = f0 Ez 
0 P(2) 
» 3 I'(1) ea 3! gi-1o-39 
3T(24 Ta) 


An even easier way to do the calculation is to recall a previous exercise 
where it was shown that the posterior mean of v —1/0 is given by 


E(v|y)-y-1. 
Thus, E(x| y) = E{E(x| y,0)| y) = E(v| y) 2 y - 1 = 3wheny 7 2. 


One way to find the predictive median of x is to solve F(x| y) —1/2 for 
x, where F(x| y) is the predictive cdf of x, or equivalently, to calculate 
Q(1/2), where Q(p) - F '(p|y) is the predictive quantile function of 
x. 


Now, the predictive cdf of x is 
3+x 


F(x| y) = [183a = f 19à where u 23 t 
0 3 


_p [tx 


mdi EN 
(34x) 3 (3-- x) 


u=3 


Setting this to p and solving for x yields the predictive quantile function, 
1 
Q(p) =F" =3| ———-1|. 
(p) (ply) | Kr | 


So the predictive median is Q B8 = o 
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The predictive quantile function can now also be used to calculate the 
8096 CPDR for x, 


(Q(0.1),Q(0.9)) = (0.1623, 6.4868), 
and the 8096 HPDR for x, 
(0,Q(0.8)) = (0, 3.7082). 


Another way to calculate the predictive median of x is as the solution in 
q of 
1/22 P(x«q|y) 
after noting that the right hand side of this equation also equals 
E(P(x « q] y,0)| ys - Ed-e "" |y) 
-1-m(-q), 
where m(t) is the posterior moment generating function (mgf) of 0. 


But (0| y) ~ Gamma(2, y +1), and so m(t) =(1-t/(y+1))°. 


So we need to solve 1/2-(1-(-q)/(y-*1) ^ for q. The result is 
q=(y £52 =1) = 1.2426 when y=2 (same as before). 


R Code for Exercise 3.3 


Qfun=function(p){ 3*(-1+1/sqrt(1-p)) }; Ofun(0.5) # 1.242641 
c(Qfun(0.1),Qfun(0.9)) #0.1622777 6.4868330 
c(0,Qfun(0.8)) # 0.000000 3.708204 


Exercise 3.4 Predicting a bus number (Extension of Exercise |.6) 


You are visiting a small town with buses whose license plates show their 
numbers consecutively from 1 up to however many there are. In your 
mind the number of buses could be anything from 1 to 5, with all 
possibilities equally likely. Whilst touring the town you first happen to 
see Bus 3. 


Assuming that at any point in time you are equally likely to see any of 
the buses in the town, how likely is it that the next bus number you see 
will be at least 4? 

Also, what is the expected value of the bus number that you will next 


see? 
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Solution to Exercise 3.4 


As in Exercise 1.6, let 9 be the number of buses in the town and let y be 


the number of the bus you happen to first see. Recall that a suitable 
Bayesian model is: 


f(y|0)=1/0, y =1,...,0 
f(0)=1/5, 0 =1,...,5 (prior), 
and that the posterior density of @ works out as 
20/47,0 23 
f(@| y)=415/47,0=4 
12747. 0 25, 


Now let x be the number on the next bus that you happen to see in the 
town. Then 


f(x|y,0)= Z, x=1,...,0 (same distribution as that of (y | 0)). 


This may also be written 
f (x| y,0)=I(x<0)/0, x=1,2,3,..., 
and so the posterior predictive density of x is 


fI - Y fe 01 -Y fGly.O f(61y) = 2,1029 (gi y. 


O=y 


In our case, the observed value of y is 3 and so: 


esis dec oe ee ee 
3 47 4 47 5 47 


f(x=2]|y)= m E m zn E x12 = 0.27270 
ra 47 4 47 5 47 


1 20 1.15 1.12 


x=3 + +—x = 0.27270 
ft y= 3. 47 4 47 5 47 
1. 15 1. 12 
x=4 + = 0.13085 
ft us 4” 47 B? 47 


1 12 
x-5|y)2 —x—— = 0.05106. 
f( y) RET 


5 
Check: > f (x| y) =0.27270x 3+ 0.13085 + 0.05106 =1 (correct). 


x=1 
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0.27270, x =1,2,3 
In summary, for y = 3, we have that f (x| y) 240.13085, x=4 
0.05106, x=5. 


So the probability that the next bus you see will have a number on it 
which is at least 4 equals 


P(x24|y)- M f(xly)= f(x24|y)* f(x-5|y) 


xix24 


= 0.13085 + 0.05106 = 18.2%. 


Also, the expected value of the bus number you will next see is 
E(x| y) =1(0.27270) + 2(0.27270) + 3(0.27270) 


+ 4(0.13085) + 5(0.05106) = 2.4149. 


Alternatively, E(x| y) = ELE(x| y, 0O)| y) -s(Ee y) -ZEO |y) 


= {14300147 +4451 474502147) EDU m 


2 2 94 


R Code for Problem 3.4 


fv-rep(NA,5); fv[1] = (1/3)*(20/47)+(1/4)*(15/47)+(1/5)*(12/47) 
fv[2] = fv[1]; fv[3] = fv[1]; fv[4] = (1/4)*(15/47)+(1/5)*(12/47) 
fv[5] = (1/5)*(12/47); options(digits=5) 

fv # 0.272695 0.272695 0.272695 0.130851 0.051064 

sum(fv) #1 (OK) 

sum(fv[4:5]) # 0.18191 

sum((1:5)*fv) 4 2.4149 

227/94 # 2.4149 


Exercise 3.5 Prediction in the binomial-beta model 


(a) For the Bayesian model given by (Y |0) ^ Bin(n,@) and the prior 
0 ^ Beta(o, 9) , find the posterior predictive density of a future data 
value x, whose distribution is defined by (x | y,0) ~ Bin(m,0) . 


(b) A bent coin is tossed 20 times and 6 heads come up. Assuming a flat 
prior on the probability of heads on a single toss, what is the probability 
that exactly one head will come up on the next two tosses of the same 
coin? Answer this using results in (a). 
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(c) A bent coin is tossed 20 times and 6 heads come up. Assume a 
Beta(20.3,20.3) prior on the probability of heads. 


Find the expected number of times you will have to toss the same coin 
again repeatedly until the next head comes up. 


(d) A bent coin is tossed 20 times and 6 heads come up. Assume a 
Beta(20.3,20.3) prior on the probability of heads. 


Now consider tossing the coin repeatedly until the next head, writing 
down the number of tosses, and then doing all of this again repeatedly, 
again and again. 


The result will be a sequence of natural numbers (for example 
3, 1, 1, 4, 2, 2, 1, 5, 1, ....), where each number represents a number of 
tails in a row within the sequence, plus one. 


Next define y to be the average of a very long sequence like this (e.g. 
one of length 1,000,000). Find the posterior predictive density and mean 
of y (approximately). 


Note: In parts (c) and (d) the parameters of the beta distribution (both 
20.3) represent a prior belief that the probability of heads is about 1/2, is 
equally likely to be on either side of 1/2, and is 8096 likely to be between 
0.4 and 0.6. See the R Code below for details. 


Solution to Exercise 3.5 


(a) First note that x is not a future independent replicate of the observed 
data y, except in the special case where m = n. 


Next recall that (0 | y) ~ Beta(a,b) , where: 
aq=a+y, b=ß+n-y. 


Thus the posterior predictive density of x is 
fG1y)— f fGs01y)d0 
= | fly. f (01 y)d0 


zi a-o — ye » O° 0-0) dé 
B(a,b) 


123 


Bayesian Methods for Statistical Analysis 


- [Bera m5) [ ea-9r ag 
3 B(a,b) A B(x+a,m—x +b) 
= , X=0,...,m. 


m|B(x+a+y,m-x+68+n-y) 
x Bla+y,8+n=y) 


Note: The distribution of (x| y) here may be called the beta-binomial. 


(b) Here, we consider the situation in (a) with n = 20, y = 6, m = 2, 
a 71, 8 =1andx=0, 1 or 2. So, specifically, 
2| B(x--1--62—x414-20—6 
ref] i B(1+ 6,1+20—6) l 
2|T(7 + x)T (17 — x)/T (24) 
| l(7)F(05)/1(22) 
2! (6+ x)!(16— x)!/23! 
x!(2 — x)! 61141/ 21! 
0.4743, x 2 0 
=; 0.4150, x=1 
0.1107, x 2. 


X 


Check: 0.4743 + 0.4150 + 0.1107 = 1 (correct). 


So the (posterior predictive) probability that heads will come up on 
exactly one of the next two tosses is f (x —1| y = 6) = 41.596. 


Note: An alternative way to do the working here is to see that if y = 6 
then 
(0| y) ~ Beta(1-- 6,1+ 20— 6) ^ Beta(7,15), 


so that: 
7 7 
E(0 me 
(ly) 7+15 22 
V(0|y) — M S = 0.009432. 
(7 4-15) (7 +15+1) 


Also, (x| y, 0) ~ Bin(2,0) (if y = 6). 
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It follows that 
P(x =1| y)  E(P(x —-1|y,0)| y} 
= E(200.—0)| y} 
= 2(E(0| y) - E(0* | y)} 
= 2(E(0 | y) -IV (6 | y) - CE(0 | y) 


C i 
22 


2 
0.009432 4- Z 
22 
(c) Let z be the number of tosses until the next head. Then 
(z| y, 0) ~ Geometric(0) 
with pdf 
f(z| y,0)=(-0)"@, z= 1,2,3..... 


| = 0.415. 


So the posterior predictive density of z can be obtained via the equation 


Gl)» [fG.61ya6- | fGly.o)f(01yyd0. 


It will be noted that (z| y) has a density with a similar form to that of 
(x| y) in (a), but with an infinite range (z = 1,2,3,...). If we were to write 
down f(z| y), we could then evaluate the expected number of tosses 
until the next head according to the equation 


E(z|y) - Y zf y). 


More easily, the posterior predictive mean of z can be obtained as 


: egy plus ace 


_ B(a-1b)p0€?7?(1- 9)? 
B(a,b) $  B(a-Lb) 
.I(a-DI()/T(a-1«b) , a-cb-1 


Tr(a) (b) /I(a b) a-1 
ayPRUPEn-y)-l. aevmtn-l 
a (a * y)-1 i QT y-1 l 


For n = 20, y = 6 and a = f = 20.3, we find that E(z| y) = 2.356. 
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(d) Here, y represents the average of a very large number of 
independent realisations of the random variable z in (c). Therefore 
(approximately), v = E(z| y,90) 21/0. 


It follows that the posterior predictive density of v is 


fü1y)- fol 


where 0 - y ' and d@/dy --y ^. Thus 


 Q/v)y" q-1/v)" |- 
fw |y)= B(a,b) 


So the posterior predictive mean of y is 
f, (9-07 
E —[w—-———-.d 
(v | y) jv MITTERET 
_ B(a- Nab tv" gau (E 
B(a,b) * y ^?" B(a—1,b) 


The last integral is 1, by analogy of its integrand with f(w |y). Thus we 
obtain the same expression as for E(z| y) and E(1/0| y) in (c), namely 


a T DB-n-1 
E(y| y) - m . 
Qty-1l 


R Code for Exercise 3.5 


options(digits=4); pbeta(0.4,20.3,20.3) # 0.1004 
pbeta(0.6,20.3,20.3) - pbeta(0.4,20.3,20.3) # 0.7993 


x-0:2 
( 2*factorial(6+x)*factorial(16-x)/factorial(23) )/ 
( factorial(x) *factorial(2-x) * factorial(6)*factorial(14)/factorial(21) ) 
# 0.4743 0.4150 0.1107 


7*15/(22^2*23) $t 0.009432 


2 * (7/22 - ( 0.009432267 + (7/22)^2 ) ) 40.415 
(20.3+20.3+20-1)/(20.3+6-1) # 2.356 
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Exercise 3.6 Prediction in the normal-normal model (with 
variance known) 


Consider the Bayesian model given by: 
Qus Y, |U) ~ tid N(u o?) 
p N(us 0); 
and suppose we have data in the form of the vector y = (y,,..., y,). 


Also suppose there is interest in m future values: 
(Xo X, | Yo) ~ iid N(j,o?). 


Find the posterior predictive distribution of 
X — (x, T... x,)/m, 
both generally and in the case of a priori ignorance regarding u. 


Solution to Exercise 3.6 


By Exercise 1.18 the posterior distribution of u is given by 
(uly) * N(u,o;), 


2 2 d 
where: u, — (1— k)u, + ky, E SM e | ; 
n 


Now, (X|y, u) ~ N(u,o? / m), and therefore 


fGly»- f f&lyanfiydn 
(x— Lu) (i-i) 
x J'ew|-2 cun e| Bx Jaw. 


This is the integral of the exponent of a quadratic in both x and w and 

so must equal the exponent of a quadratic in x . It follows that 
(X|y)~N(7,6°), 

where y and ó? are to be determined. This final step is easily achieved 

as follows: 


n= E(x|y) 
= E{E(X| y, u) | y} 
= Etu| y} = a. 
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ó* -V(x|y) 
= E(V(x | y, 1)| y) - VUE(x | y, | y} 


2 2 
m m 


Thus generally we have that 


(x|y)~N 


2 2 
o 


n a k— 4 — 
n m 


2 
iste TN 
m 


A special case is where there is no prior information regarding the 
normal mean jz. In this case, assuming it is appropriate to set o, — oo 


(so that f(z) <1, we), we have that k = 1 and hence 
= Le uw 
ep nr 
n m 


Exercise 3.7 Prediction in the normal-gamma model (with a 
known mean) 


Consider the Bayesian model given by : 
Dess Y, |A) ~ iid NULLA) 
A^ G(a, B), 
and suppose we have data in the form of the vector y = (y,,..., y, ). 


Also, suppose we are interested in m future values: 
(Geo LA^ iid N(u,1/A). 


Find the posterior predictive distribution of 
X=(x +... +X) / M, 
both generally and in the case of a priori ignorance regarding A. 


Solution to Exercise 3.7 


By Exercise 1.20 the posterior distribution of A is given by 
(A | y) ~ Gamma(a, b) , 
n n > 2 1d j 
where: a-atz, b=, 5 $,--».0;-nm.. 
n izi 


yu? ye 


Now, (x |y, A) ~ N(u,1/(mA)), and therefore 
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fG1y)— f fGlyo9 fO 12A 


7 or - 
x | Vrew|-S a", ‘exp(—Ab)dA 
1 


= pU exp E b + 2G — py | dà 


zh 2a44 
m2a(x — u)? pu 
2b 


; X—p Nb/a 


Now let Q — = , So that X= u +Q 
Jm 


2b (Vb/a)/-vm 
. (S) 
2a 


Then by the transformation rule, 


1 
——(2a+1) 
2\2 
oc [s L) 
2a 


This implies that (Q | y) ^ t(2a), or equivalently, 


X= | 
2 
$£,7F287n Ta 
1+2a/n 
A special case of this general result is when there is no prior information 


regarding the precision parameter A . In that case, and assuming it is then 
appropriate to set a= / =0 (so that f (A2)o:1/ A4, A4 » 0), we have that 


Vb/a 


f(QIy)- f(x|y) Jm 


Ed 
dQ 


y|^t(n--2a). 


xX 


s. / m 


ML 


y|~ t(n). 
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3.3 Posterior predictive p-values 


Earlier, in Section 1.3, we discussed Bayes factors as a form of 
hypothesis testing within the Bayesian framework. An entirely different 
way to perform hypothesis testing in that framework is via the theory of 
posterior predictive p-values (Meng, 1994). As in the theory of Bayes 
factors, this involves first specifying a null hypothesis 

HB, 
and an alternative hypothesis 

HE, 


where E, and E, are two events. 


Note: As in Section 1.3, E, and E, may or may not be disjoint. Also, 
E, and E, may instead represent two different models for the same data. 


In the context of a single Bayesian model with data y and parameter 6, 
the theory of posterior predictive p-values involves the following steps: 
(i) Define a suitable discrepancy measure (or test statistic), denoted 
T(y,6), 
following careful consideration of both H, and H, (see below). 
(ii) Define x as an independent future replicate of the data y. 
(iii) Calculate the posterior predictive p-value (ppp-value), defined as 
p = P{T(x,0)2T(y,@)| y}. 


Note 1: The ppp-value is calculated under the implicit assumption that 
H, is true. Thus we could also write p = P{T(x,0)=T(y,0)|y,H,}. 


Note 2: The discrepancy measure may or may not depend on the model 
parameter, 0. Thus in some cases, T(y,@) may also be written as T(y). 


The underlying idea behind the choice of discrepancy measure T is that 
if the observed data y is highly inconsistent with H, in favour of H, 
then p should likely be small. This is the same idea as behind classical 
hypothesis testing. In fact, the classical theory may be viewed as a 
special case of the theory of ppp-values. The advantage of the ppp-value 
framework is that it is far more versatile and can be used in situations 
where it is not obvious how the classical theory should be applied. 
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An example of how ppp-value theory can perform well relative to the 
classical theory is where the null hypothesis is composite, meaning that 
it consists of the specification of multiple values rather than a single 
value (e.g. H,:|0|« & as compared to H,:0 —0). The next exercise 


illustrates this feature. 


Exercise 3.8 Posterior predictive p-values for testing a 
composite null hypothesis 


Consider the Bayesian model given by: 
(y |A) ~ Poisson(A) 
fme 40, 

and suppose that we observe y = 3. 


(a) Find a suitable ppp-value for testing 
H,:4 =1 versus LI 72. 


(b) Find a suitable ppp-value for testing 
H :4€{1,2} versus H,:A >2. 


Solution to Exercise 3.8 


(a) Here, (x| y, A) ~ Poi(A) , and we may define the test statistic as 
T(y, A) - y. 


Then, the posterior predictive p-value is 
p-P(xzy|y,A-1) 
x Faia Y -1) , 


where y = 3 and where F,,,...(r) is the cumulative distribution function 


oi(q 
of a Poisson random variable with mean q, evaluated at r. 


Thus a suitable ppp-value is 


og) Pay? "p 
p=i-(§ ms =) = 0.08030. 
0 1 2! 


Note: This is just the probability that a Poisson(1) random variable will 
take on a value greater than 2, and so is the same as the classical 
p-value which would be used in this situation. 
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(b) Here we first observe that 
f (A | y, Ho) e f (A | Ho) fy | HS, A) 


-À -7y 
- = e å ce?4", A =1,2 (with y=3). 
e'-ce y 
e? 
Thus: P(A-1|y,H,)- - 0.48015 


e Bu e”??? 
P(A =2|y,H,) = 1-0.48015 = 0.51985. 


So a suitable ppp-value is 
p=P(x2y|y,H,) = E{P(x 2 y | y, HoA) | y, Ho} 


= E(1- Fooi Y ux | y, Hl 
= 0.48015x(1 — F,,4,(2))  0.51985x(1— F,,,,(2)) 


—140 —141 —142 
camis k Ler el } 
0 1 2 


—250 251 —22 
+0590 [ m Ln 22 } 
0 d 2! 


= 0.20664. 
R Code for Exercise 3.8 


options(digits=5); 1-ppois(2,1) # 0.080301 
p1=exp(-2)/(exp(-2)+8*exp(-4)); c(p1,1-p1) # 0.48015 0.51985 
p1*(1-ppois(2,1))+(1-p1)*(1-ppois(2,2)) # 0.20664 


Exercise 3.9 Posterior predictive p-values for testing a normal 
mean 


Consider a random sample y,,...,y, from a normal distribution with 


variance o^, where the prior on the precision parameter 2 —1/0^ is 
given by 4 ~ Gamma(0,0), or equivalently by f(4)o:1/4, 4» 0. 


We wish to test the null hypothesis 
H,: that the normal mean equals 4 


against the alternative hypothesis 
H,: that the normal mean is greater than 4 


(where u is a specified constant of interest). 
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Derive a formula for the ppp-value under each of the following three 
choices of the test statistic: 


(a) T(y,2)-y, (b TQ.À- J ==, © TO.D- 


YH 
/ 


5 
z 


where: y= 2y y, (the sample mean) 
nia 


s? = = (y,—yY (the sample variance). 
ni; 


For each of these choices of test statistic, report the ppp-value for the 
case where 4 = 2 and y = (2.1, 4.0, 3.7, 5.5, 3.0, 4.6, 8.3, 2.2, 4.1, 6.2). 


Solution to Exercise 3.9 
(a) Let x = (x, 4-...-—- x,)/n be the mean of an independent replicate of 


the sample values, defined by (x,,..., x, | y, A) ~ iid N(j,07). 


X-H 


| n 


Sy H 


Then, by Exercise 3.7, y 


1 n 
~t(n), where sj,—— i-i». 
i=1 


From this, if the test statistic is T(y, A) = y , then the ppp-value is 


You 
y =1 | . 
| al =| 


ee = Ee vec y,= 4370, s, = LY (y, - u}? = 2978. 
n n 


yu 
i=1 


y 
Zr E uc 


Therefore —~—“_ = 2.51658, and so p =1- Fy (2.51658) = 0.01528. 


(b) If T(y, 4) =% then the ppp-value is 


o | dn 


p-r y-u 


> 
o | Nn o | Yn 


»J- eain. 
We see that the answer here is exactly the same as in (a). 
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(c) If T(y, 4) - 


then the ppp-value is 


de 
Jn 


Arle 


eu Pos: uns 


We see that the ppp-value derived is exactly the same as the classical 
p-value which would be used in this setting. Numerically, we have that: 


- where s= y (xy 
n= 


my 


by the law of iterated expectation 


X-u 
since 
a (à fan 


yaa ^ t(n —1) 


s=- 420- —yy =1.901, 2 =3.942645. 


| n 


x 


Consequently, the ppp-value is p 2 1— F, (3.942645) = 0.001696. 


Note: A fourth test statistic which makes sense in the present context is 


ve i where S D u)” (as before). 


yu 


This implies a ppp-value given by 


| where s}, = -Y =i) 


i=1 


R ODE 
P 
hs zs "ms D 


This ppp-value is more difficult to calculate, and it cannot be expressed 
in terms of well-known quantities, e.g. the cdf of a t distribution, as in 
(a), (b) and (c). (Here, x and s,, are not independent, given y and x.) 


For more details, regarding this exercise specifically and ppp-values 
generally, see Meng (1994) and Gelman et al. (2004). 
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R Code for Exercise 3.9 


options(digits=4); muz2; y = c(2.1, 4.0, 3.7, 5.5, 3.0, | 4.6, 8.3, 2.2, 4.1, 6.2); 
n=length(y); ybarzmean(y); s=sd(y); smuzsqrt(mean((y-mu)^2)) 
c(ybar,s,smu) # 4.370 1.901 2.978 
arga-(ybar-mu)/(smu/sqrt(n)); pppa=1-pt(arga,n); c(arga,pppa) 

# 2.51658 0.01528 
argc=(ybar-mu)/(s/sqrt(n)); pppc-1-pt(argc,n-1); c(argc,pppc) 

# 3.942645 0.001696 


3.4 Bayesian models with multiple parameters 


So far we have examined Bayesian models involving some data y and a 
parameter 0 , where @ is a strictly scalar quantity. We now consider the 
case of Bayesian models with multiple parameters, starting with a focus 
on just two, say @, and @,. In that case, the Bayesian model may be 
defined by specifying f(y|@) and f(0) in the same way as previously, 
but with an understanding that @ is a vector of the form 0 = (0,0,) . 


The first task now is to find the joint posterior density of 0, and @,, 
according to 


f (01 y) « FOFO |8), 


or equivalently 

f (8.0, |y) « f(6,0)) FO |00), 
where 

f(@)= f(9,,4,) 


is the joint prior density of the two parameters. 


Often, this joint prior density is specified as an unconditional prior 
multiplied by a conditional prior, for example as 


f (8,0) = f(A) f (0, | A) . 


Once a Bayesian model with two parameters has been defined, one task 
is to find the marginal posterior densities of 0, and 0, , respectively, via 
the equations: 


FO\y)=| f(6.6,1»)46, 
f(6,1y) = | f(6.6,1)46.. 
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From these two marginal posteriors, one may obtain point and interval 
estimates of 0, and @, in the usual way (treating each parameter 


separately). For example, the marginal posterior mean of 0, is 


Ô = E(8,| y) - [G.f(6,15)46,. 


Another way to do this calculation is via the law of iterated expectation, 
according to 


Ô = E(0 | y) = ECE(& 1,6) | yl 
= [E(8.1y.62 F (0, 1046, . 


Note: The equation E(0,| y) - E(E(0,] y,O,)| y) follows from the 
simpler identity EO, = EE(0, |0,) after conditioning throughout on y. 


Here, E(0, | y, 0,) is called the conditional posterior mean of 0, and can 
be calculated as 


E(6,| y,0,)= [&f(8.1y,6)48.. 


Also, f(@,| y,0,) is called the conditional posterior density of 0, and 
may be obtained according to 


f (6, | y, 6.) « f(8,6, | y). (3.1) 


Note: Equation (3.1) follows after first considering the equation 
f (0,|0,) cc f (6,0,) and then conditioning throughout on y. 


The main idea of Equation (3.1) is to examine the joint posterior density 
f (6,0, | y) 

(or any kernel thereof), think of all terms in this as constant except for 

0, , and then try to recognise a well-known density function of @,. 


This density function will define the conditional posterior distribution of 
0,, from which estimates such as the conditional posterior mean of 0, 


(i.e. E(@, | y, 0,)) will hopefully be apparent. 


One may also be interested in some function, 
y — g(8,0,), 
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of the two parameters (possibly of only one).Then advanced distribution 
theory may be required to obtain the posterior pdf of y , i.e. f(W |y). 


This posterior density may then be used to calculate point and interval 
estimates of y . For example, the posterior mean of y is 


V =Ely |y)= fy f(y y)dy. 


Alternatively, this mean may be obtained using the equation 
Wi = E(g(9,,9,)|y) = [[9(6.0) f (8,0, | y)d0,d0, . 


Further, one may be interested in predicting some other quantity x, 
whose model distribution is specified in the form f (x| y,0). 


To obtain the posterior predictive density of x will generally require a 
double integral (or summation) of the form 


foy) = ff fed y. 8.6) f(8.6,|y)d640,. 


Further integrations will then be required to produce point and interval 
estimates, such as the predictive mean of x, 


R= E(x| y) - [xf G1 yoax. 


Exercise 3.10 A bent coin which is tossed an unknown number 
of times 


Suppose that five heads have come up on an unknown number of tosses 
of a bent coin. 


Before the experiment, we believed the coin was going to be tossed a 
number of times equal to 1, 2, 3, ..., or 9, with all possibilities equally 
likely. As regards the probability of heads coming up on a single toss, 
we deemed no value more or less likely than any other value. We also 
considered the probability of heads as unrelated to the number of tosses. 


Find the marginal posterior distribution and mean of the number of 
tosses and of the probability of heads, respectively. Also find the number 
of heads we could expect to come up if the coin were to be tossed again 
the same number of times. 
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Solution to Exercise 3.10 


For this problem it is appropriate to consider the following three-level 
hierarchical Bayesian model: 


(y |0,n) ~ Binomial(n,0) 
(8|n) ^ U(0,1) 
ncDU(L. ,k), k=9 i.e. f(n)=1/9, n= 5,9). 


Under this model, the joint posterior density of the two parameters n and 
0 is 
f (n.0| y) x f(n,0) f Cy | n,0) 
= fia f(?|]n)f(y|n.9) 


e bui 
k 


n 
jra-o 
y 
x[i pa- 0<0<1, n= y,y+1,...,9. 
y 


So the marginal posterior density of n is 


f(n1y) =f f(n.01y)d0 
(n 
«Jl r'a-oran. n=y,y+1,...,9 (since y 20,..,n) 
0 y 


gr. go 
B(y+1,n-y+1) 


1 
n 
-| [po ein venf d0, n=5,6,7,8,9 
Y 0 


-p rosae 


x1 (since the integral equals 1) 


yJ I(y+1+n-—y+1) 
u n! y!(n — y)! 
X yl(n—y)! (n4-1)! 
NE. 
n+l 

1/6, n=5 

1/7, n=6 
—11/8, n=7 

1/9, n=8 

1/10, n=9. 
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After normalising (i.e. dividing each of these five numbers by their sum, 
0.6456), we find that, to four decimals, n's posterior pdf is 


0.2581, n=5 
0.2213, n=6 
f(n|y)=;0.1936, n=7 
0.1721, n=8 
0.1549, n=9. 


Thus, for example, there is a 17.2% chance a posteriori the coin was 
tossed 8 times. 


It follows that n’s posterior mean is 
9 


ñ= E(n|y)= 3 nf (n| y) 


n=6 
= 0.2581x 5+ 0.2213x 6 +...+ 0.1549x9 
= 6.744. 


Next, the marginal posterior density of 0 is 


f(61y) 2 3; finaly) 
3n 
6”(1—0)" » 
xDe 


y+ n—y+1-1 
=|" po in rent de”) 


B(y+1,n— y +1) 


M d 


m frentian O) . 


Recall that f(n| y) ox1/(n--1). It follows that 6’s marginal posterior 
density must be exactly 


f (0 | y) = > f(n | y) leap?) 


5 5-5 5 9-5 
—02581—9-0—9) "  L..£01549—— N " 
51(5 — 5)!/ (54- 1)! 51(9 — 5)!/ (9+1)! 


We see that 0's posterior is a mixture of five beta distributions. 
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Note: This result can also be obtained, more directly, as follows. By 
considering the ‘ordinary’ binomial-beta model (from earlier), we see 
that in the present context the conditional posterior distribution of 0 
(given n) is given by 

(0| y, n) ~ Beta(y -1,n— y +1). 
It immediately follows that 


f(61y) - 9 On D 2$ j f(nly)f(81y,n) 


m D f(n | y) laona (0) . 


We may now perform inference on 0 . The posterior mean of 0 is 


Ô= E(0| y) - E(E(0| y,n)| y) mp » 
n4-2 
2 1 
=o) ] f(n|y) 


=6 E o2581« [5 o2213-« [5 ]o6 « [7 Jo.1721+| + |o.1s49 
7 8 9 10 11 


= 0.7040. 


Figures 3.3 and 3.4 (page 141) show the marginal posterior densities of n 


and 0 , respectively, with the posterior means ñ = 6.744 and Ê = 0.7040 
marked by vertical lines. 


Finally, we consider x, the number of heads on the next n tosses. 
The distribution of x is defined by (x | y,n,0) ~ Bin(n, 0). 
So the posterior predictive mean of x is 


= E{E(n0 | y,n)| y) = E{nE(0 | y,n)| y} 


es 2e 
— E|nx =(y+1 — —. f (n 
zy) (y Ds En f (n|y) 
—É [S oasene [S o3. [7 o1536 [5 oz [7 osa 
7 8 9 10 11 
= 4.592. 
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Figure 3.3 Posterior density of n 


04 
L 


f(nly) 
0.2 
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-l 


Figure 3.4 Posterior density of 0 


f(theta|y) 


00 05 10 15 20 25 


0.0 02 0.4 0.6 0.8 1.0 


theta 


R Code for Exercise 3.10 


y <- 5; k <- 9; options(digits=4) 
nvec«-y:k; avec <- 1/(nvec+1); sumavec <- sum(avec); sumavec # 0.6456 
fny <- avec/sumavec; rbind(nvec,avec,fny) 
# nvec 5.0000 6.0000 7.0000 8.0000 9.0000 
# avec 0.1667 0.1429 0.1250 0.1111 0.1000 
# fny 0.2581 0.2213 0.1936 0.1721 0.1549 
nhat <- sum(nvec*fny); nhat # 6.744 
thhat <- sum( fny * (y+1)/(nvec+2) ); thhat # 0.704 
xhat <- sum( fny * nvec * (y+1)/(nvec+2) ); xhat # 4.592 
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thvec «- seq(0,0.99,0.01); fthyvec «- thvec 
for(i in 1:length(thvec)) fthyvec[i] <- sum( fny * dbeta(thvec[i],y+1,nvec-y+1) ) 


X11(wz8,hz4); par(mfrowzc(1,1)) 


plot(nvec,fny,typez"n",xlabz"n",ylabz"f(n |y)", ylim=c(0,0.4)) 
points(nvec,fny,pch=16,cex=1); abline(vznhat) 


plot(thvec,fthyvec,typez"n",xlabz"theta",ylabz"f(theta | y) ",ylimzc(0,2.5)) 
lines(thvec,fthyvec,lwd=3); abline(v-thhat) 


Exercise 3.1 | The uninformative normal-normal-gamma model 


Consider the following Bayesian model: 
(iss Yu 1 eA) ~ iid N(u,1/ 2) 
(u| A) ~ N(0,oc) 
A ~ Gamma(0,0), 

with observed data y = (y,,..., y,). 


(a) Find the marginal posterior distribution of ju. 


(b) Find the marginal posterior distribution of A. 


(c) Find the posterior mean of the signal to noise ratio, defined as 
y-ulo- uA à 


(d) Find the posterior predictive distribution of 
X 2(Xx +..+X,,)/m, 

where the x, values have a distribution given by 
(Xoe Xn | Vif APN (A A). 


Note: Both u and A are assigned uninformative priors. The joint prior 
distribution of these two parameters could also be specified by: 
f(ulà)x1 u ER 
OCTANE; 
or by the single statement 
f(m, Aà)x1/A, WER, A0. 
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Solution to Exercise 3.11 


(a) The joint posterior density of the two parameters u and A is 
f (uA | y) x i = fOu fp) ft u^) 


(y; — H) 
oo ba] "o = e|- P 
=A eg|-3Y:0 | 


So the marginal posterior density of ju is 


fq) f fi.N)2A 
EDI dA 


H e n 
x f A? exp 
0 


oo 


H I'(n/2) 1 fig 
ETE So] 


xA? exp 


axe] 


-— n x pzo, | E 
[ze c 


Note: The last integral is that of a gamma density and so is equal to 1. 


Now observe that 


Do- -Y: 0-3) 0 -:9y 
=U, - PW +25- WLU. - D+ G- LA 


—(n— UE zo y |+- Y — ny) +n(y — uy 


— (n—1)s? 4- n(u — y), where s? is the sample variance. 
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This result implies that 


f (y) {(n—1)s* +n(u—yy y} 


Nis 


1 
2] 5 0-020 


E age [Ex 
: [em — |, M Yn 


(n—-1)s? (n—1) 


ETY so that py and du s. 


s/ 4n Vn dr yn 
By the transformation rule, we then have that 


fri) fly) M 
r 


; a Flo . " Flory 
ox 41+ —— x o4 1+ ——_ : 
(n—1) (n—1) 


By definition of the t distribution, we see that (r | y) ~ t(n— 1). 


We now define r — 


vn 


It follows that the marginal posterior distribution of ~ is given by 
[E 


s/ Vn 


yt -n (3.2) 


Note 1: In result (3.2), the data vector y appears only by way of the 
sample mean y and sample standard deviation s. So it is also true that 
— 


s/ Vn 


ysJ-t-n. 


Here, s may not be left out of the conditioning. So it is not true that 


LEE en. 


s/ n 


Note 2: Result (3.2) implies that the marginal posterior mean, mode and 
median of 4 are all equal to y , and the 1— 2 CPDR/HPDR for u is 
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(Yt, (n—1)s/ Ahn). 


This inference is identical to that obtained via the classical approach and 
thereby justifies the use of the joint prior 
f (p, Mx1/A, WER, A»0 


in cases of a priori ignorance regarding both u and A. 


Note 3: The exact marginal posterior density of u is 


fiy) f(rly) 


E 


dr 
du 
By 


s/n 


where r — and (r | y) ~t(n—1). 


UCE DEEA 
(n=Dm)I((n-=1)/2) 


Se 
TEE EX 


n—1ls/4n 


Ex 


Sy) 


Vn 


S 


x „HER. 


This density can be calculated in R at any point u by first calculating 


the corresponding value of r and then returning 
dt(r,n-1)*sqrt(n)/s 
(see below for examples). 


(b) The marginal posterior density of A is 


FAID = f fa.Xyxu 


Í a exp 


—oo 


-n-p enti- yy) |an 


n 
2 


LA E F JOTMANNV2T 
xf — g 
J asm) N27 


XE 


af JR x eh DUE. 


1 —2 
^u my » Jos 


Ao 
—A? e 
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Note: The last integral is that of a normal density and so equals 1. 


It follows that 

649) Gamma | >=") [87:3 3.3) 
and hence also that 

((n-1)s°A| y)~ x! (n-1). (3.4) 


Note 1: Result (3.4) can be proved as follows. Let 
u=(n-1)s*/, 
dà 1 


so that 4 =—— and — = —. 
(n-1)s du (n—1)s 


Then, by the transformation rule, 


fuly= fo E 
u 
x u ES n TET uu il 
(n—1)s* (n—-1)s? 
"PE 


Thus (u | y) ~ Gamma ux ~ x (n—1), which confirms (3.4). 


Note 2: Results (3.3) and (3.4) imply that A has posterior mean 1/s^. 
This makes sense because 2 —1/ 6^ , and s? is an unbiased estimator of 
c^ . We see that the inverse of the posterior mean of A provides us with 
the classical estimator of o°. 


Also, result (3.4) implies that the 1— € CPDR for / is 


Xana) X 501-1) 
(nep END E 


It follows that the 1—@ CPDR for o? 21/4 is 
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| (n-1)s? (n-1)s? 
te) p comen) l 


It will be observed that this is exactly the same as the usual classical 
1-a Cl for o^ when the normal mean yw is unknown. 


(c) The posterior mean of y = u/o = pd could be calculated using the 
equation 

p=lvfalyar, 
where f (y | y) is the posterior density of y . 


However, obtaining this density may be difficult. We could use Jacobian 
theory to find the joint posterior density of 14 and y , and then integrate 
that joint density with respect to u. The result would be f (y |y). 


Another approach is to calculate the mean as 


4- EQ py). ff wf WAL dnd, 


Ji——06 A=0 


where: f(j,A| y) = mM 


k(u, A) = PON exp 


AX] 


c= Í Í h(u, X)dudA . 


Ji——00 A=0 


More simply, we may use the law of iterated expectation to write 
7 =E(uvA|y) - ELEGIA |y, 2)] y) - EHWAEGQUL y, A) y} 
= E{VAy|y} = YE" | y) 


(559) (C2) 


=y 
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C 
DE 


Note 1: By a well-known property of the gamma function, c, — 1 as 


where 


n — œ. So for large n the posterior mean of y = u/o is approximately 
the same as y's MLE, y/s. 


Note 2: Suppose that we wish to find the posterior median or mode of y 
or the 95% CPDR or HPDR for that quantity. Then we first need to 
determine f(y|y). This and subsequent calculations may be difficult. 


This points to the need for another strategy. As will be seen later, most 
of these issues can be easily sidestepped using Monte Carlo methods. 


(d) Recall from previous exercises that: 


(X|y,) 7 ny, 24A]. v.m 
m n nmA 


n—1|[n-—1| , 
(A|y)^ Gamma |^: [72 i 


Hence f(x1y) f fly.) fAly)dA 


X? exp] — nmX(x — y) 
2(n+m) 


XP setze 


n nm(x — y) [sp lax 
2(n+m) 


oo 


<f 


0 


y | nm = yy iz 
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nm(x — y) 
(n—1)(n+m)s? 


x-y i 
ercon 


n—1 


It follows that 


x-y 
(s/ Vn) n m)/m : 


Note 1: Equation (3.5) can be used to derive the predictive distribution 
of the average of all n * m values considered (both past and future). 


^ t(n—1). (3.5) 


That average i be written 


x beis- DALLE 


^nm n+m 
Consequently, 
x- (n m)a - ny 


m 


It follows that in (3.5), 
n--m)a-ny| _ 
|‘ ) ”) 7 
m 


X y " 
(s/n) J(no-m)/m | (s/Vn)J(n+m)/m 


_ (a—y)(n+m)/m 


(s/Vn),/(n+m)/m 


m Ua 
(s/Vn)Jm/(n+m) 
and therefore 
ew ~t(n—1). 3.6 
eer ims vH) 
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This may look familiar to some readers, the reason being as follows. 
Denote the total number of values, n * m, as N, and write the average of 


all these observations, a, as Y . Then (3.6) is equivalent to the result 


y|~t(n—1). (3.7) 


So the posterior predictive mean of Y is the observed sample mean y, 
and the 9596 central (and highest) predictive density region for Y is 


yt, 1-2) (3.8) 


It will be noted that this inference is exactly the same as implied by the 
standard approach in the classical survey sampling framework (e.g. see 
Cochran, 1977). 


Recall that in this framework, V1—n/N is the finite population 
correction (fpc) factor. As N increases, the fpc factor tends to 1 and (3.8) 
reduces to 


[p t, n -D> , 


which is the ‘standard’ CI for a normal mean when the normal variance 
is unknown. 


We have here touched on the topic of Bayesian finite population 
inference. More will be said on this topic later in the book. 


Note 2: The exact posterior predictive density of the finite population 
mean Y may be obtained according to 


fos» may) 


£r 
dx 


Xy 
(s/ Vn) —n/N 
(q|y) ^ t(n—1). 


where: q— 
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BGS Da hy 2) 
TI'((n—1)z)I((-—1)/2) 


-H n—1)+1) 


We thereby obtain the density f (Y | y) = 


LE 2 
"em n NN T LA eg. 
n—1|(s/Vn)Vl—n/N sVl—n/N 


This density can be calculated in R at any point Y by first calculating 

the corresponding value of q (as defined above) and then returning 
dt(q,n-1)*sqrt(n)/(s*sqrt(1-n/N)) 

(see below for an example). 


Note 3: The posterior predictive density of Y converges to the marginal 
posterior density of u as N tends to infinity with n fixed. That is, 


f(Y 2c|y) — f(u#=cly) as Noo. 


This is on account of the fpc factor /1—n/ N converging to unity. Thus 
u may be interpreted as the average of a hypothetically infinite number 
of values from the underlying superpopulation, N(441/ A). 


Figure 3.5 shows the predictive density f (Y | y) for various values of N, 
as well as the posterior density f(| y), corresponding to the limiting 
case N —oo. In each case, the values of n, y and s are (arbitrarily) taken 
as 5, 10 and 2, respectively. Note that N =œ &» m 2o since m- N —-n. 


Note 4: Consider the following Bayesian model: 
(iss Yu [LG A) ~ tid N(u,1/ A) 
(HA) ~ N (toa) 
A ~ Gamma(a, 8), 
where o, is not necessarily oo and « and / are not necessarily 0. 


This may be called the (general) normal-normal-gamma model, as 
distinct from the uninformative normal-normal-gamma model, here in 
Exercise 3.11. In the general model, the inferences typically required are 
much more difficult to perform. Later in the book, it will be shown how 
to proceed in this—and similarly difficult—situations using Monte Carlo 
methods, including Markov chain Monte Carlo (MCMC) methods. 
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Figure 3.5 Predictive density of the finite population mean 
(See Note 3 on page 151) 


1.0 


J 
- - N=6(m=1) : 
t N=7 (m2) i 
-=- N=10 (m-5) J 
7 —- N=40 (m=35) ! 
—— Neinfinity (=m) 


0.8 


1 
J 

The solid line is also the K 
posterior density of mu, f 
namely f( mu | y ). p s 


0.6 
l 


f( Ybar| y) 


R Code for Exercise 3.11 


X11(w=8,h=6); par(mfrowzc(1,1)) 
ybar=10; s=2; cv=seq(0,20,0.005) 
plot(c(4,16),c(0,1),type="n"",xlab="Ybar", ylab="f( Ybar | y )", mainz" ") 
n=5; rv=(cv-ybar)*sqrt(n)/s; lines(cv, dt(rv,n-1)*sqrt(n)/s,lty=1,lwd=2) 
Nvec=c(6,7,10,40) 
for(i in 1:length(Nvec)){ NNvec[i]; qv=rv/sqrt(1-n/N) 
lines(cv, dt(qv,n-1)*sqrt(n)/(s*sqrt(1-n/N)),lty=i+1,lwd=2) } 
legend(4,1, 
c("N=6 (m=1)","N=7 (m=2)","N=10 (m=5)","N=40 (m=35)","N=infinity (=m)"), 
Ity=c(2:5,1),lwd=2) 
text(6,0.6, 
"The solid line is also the\nposterior density of mu,\nnamely f( mu | y ).") 
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4.1 Solving equations 


In most of the Bayesian models so far examined, the calculations required 
could be done analytically. For example, the model given by: 

(Y |0) ~ Binomial(5,0) 

0 ^ U(0,1), 
together with data y = 5, implies the posterior (0 | y) ~ Beta(6,1). So 0 
has posterior pdf f(0| y) — 60? and posterior cdf F(0|y) — 6°. Then, 


setting F(0| y) 21/2 yields the posterior median, 0 =1/2"° = 0.8909. 
g 


But what if the equation F(0| y) 21/2 were not so easy to solve? In that 


case we could employ a number of strategies. One of these is trial and 
error, and another is via special functions in software packages, for 
example using the qbeta() function in R. This yields the correct answer. 
Yet another method is the Newton-Raphson algorithm, our next topic. 


R Code for Section 4.1 


qbeta(0.5,6,1) #0.8908987 
4.2 The Newton-Raphson algorithm 


The Newton-Raphson (NR) algorithm is a useful technique for solving 
equations of the form g(x) — 0. 


This algorithm involves choosing a suitable starting value x, and 
iteratively applying the equation 

Xj 7 X, 9'(%) g(x) 
until convergence had been achieved to a desired degree of precision. 


How does the NR algorithm work? Figure 4.1 illustrates the idea. 


153 


Bayesian Methods for Statistical Analysis 


Figure 4.1 The Newton-Raphson algorithm 


gx) 


Here, a is the desired solution of the equation g(x) = 0, c is a guess at that 
solution, and b is a better estimate of a. Observe that the slope of the 


tangent at point Q is equal to both g'(c) and g(c)/(c —b) . Equating these 
two expressions we get b=c—g(c)/g'(c). 


Note: Sometimes the NR algorithm takes a long time to converge, and 
sometimes it converges to the wrong or even impossible value or gets 
‘stuck’ and fails to converge at all. This is a general problem with the 
NR algorithm, namely its instability and the need to start it off with an 
initial guess that is sufficiently close to the desired solution. 


Exercise 4.1 Calculating a posterior median via the Newton- 
Raphson algorithm 


Suppose that the posterior cdf of a parameter is F (0 | y) = 6°. 


Find the posterior median by solving the equation F(@| y) 21/2 
via the Newton-Raphson algorithm. 


Note: The algorithm should converge to the analytical solution, namely 
2. 0.2909) 


Solution to Exercise 4.1 


We wish to solve g(0) =0, where g(0) = F(0| y) —1/2. 


Here, g'(0) = f (0| y) -0, where f(0] y) 2 665. 
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So the algorithm is given by 
_ g(O;) = ESI s 07 —1/2 
"T l Se 
= 7 gE 7 f(0,|y) ! 68 


Starting at the posterior mode, 0, = 1 (chosen arbitrarily), we get the 
sequence shown in Table 4.1. 


Table 4.1 NR algorithm starting from | 


j 0 il 2 3 4 
0, | 1.0000 0.9167 0.8926 0.8909 0.8909 


So the posterior median is 0.8909. The same result is obtained if we start 
with 0, — 0.8, as shown in table 4.2 


Table 4.2 NR algorithm starting from 0.8 


j 0 i n 3 4 
0, | 0.8000 0.9210 0.8933 0.8909 0.8909 


Note 1: The median must satisfy 
6 — 

0-0— eiu ud 2 : 
60 


This equation is indeed satisfied at the solution 0 = 0.8909 (working 
to four decimals). This illustrates how to check whether or not the NR 
algorithm has converged properly. 


Note 2: In this simple example, one could get the answer by solving the 
equation 0 = 0 — g(0)/ g'(0) analytically. In general, that won't be 
possible, and iterating the algorithm will be required. Of course, if it is 
possible to solve that equation analytically, there is no need to iterate. 
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R Code for Exercise 4.1 


NR <- function(th,J=5){ 
# This function performs the Newton-Raphson algorithm for J iterations 
# after starting at the value th. It outputs a vector of th values of length J+1. 
thvec <- th; for(j in 1:J){ 
num «- th^6-1/2 # theta's posterior cdf minus 1/2 (numerator) 
den «- 6*th^5 # theta's posterior pdf (denominator) 
th «- th - num/den 
thvec «- c(thvec,th) } 
thvec } 


options(digits=4) 

NR(th=1,J=6) # 1.0000 0.9167 0.8926 0.8909 0.8909 0.8909 0.8909 
NR(th=0.8,J=6) # 0.8000 0.9210 0.8933 0.8909 0.8909 0.8909 0.8909 
0.8909-(0.8909^6-0.5)/(6*0.8909^5) # 0.8909 (Check) 


Exercise 4.2 Further practice with the NR algorithm 

Use the Newton-Raphson algorithm to solve the equation t^ =e’. 
Note: In this case there is no analytical solution. 

Solution to Exercise 4.2 


We wish to solve g(t) 2 0, where g(t) =t* —e'. Now, g'(t) = 2t—e’. 


fg 
So we iterate according to t; — t; — | 7 E? | . 
= 


Let us arbitrarily choose t, =0. Then we get: 


0 -e E NE 
t-0--0-—^. 10990000, t,-(-D-49.—5— - 0,733044 
2(0)—e a1) =e 
0.733044)? — 0.73304 
t, = (0.733044) 0-733044) -e __4 703808 


2(-0.733044) — e” 


= 2 _ ,-0.703808 
t, = (0.703808) a = —0.703467 
=U, = ` 


(-0,703467) e o 


t, = (0.703467) - 09/2. TE eur 70.703467, etc. 
;-( ) 300708467) -e = 
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Thus the output of the NR algorithm starting from 0 is: 
0.000000, -1.000000, -0.733044, -0.703808, -0.703467, -0.703467, 
-0.703467, -0.703467, ..... 


Also, we find that the output of the NR algorithm starting from 1 is: 
1.000000, -1.392211, -0.835088, -0.709834, -0.703483, -0.703467, 
-0.703467, -0.703467, ..... 


From these results we feel confident that the required solution to 6 
decimals is —0.703467. As a check, we calculate 


g(—0.703467) = (-0.703467)? — e ?^*5" = 0.000000803508 ~ 0. 


Figure 4.2 illustrates the function g and the output of the NR algorithm 
starting from —5, which is: 
-5.000000, -2.502357, -1.287421, -0.802834, -0.707162, -0.703473, 
-0.703467, -0.703467, ..... 


Figure 4.2 Solution via the NR algorithm starting at —5 


R Code for Exercise 4.2 


options(digits=6); t=0; tv=t; for(j in 1:7){ t-t-(t^2-exp(t))/(2*t-exp(t)) 
tv=c(tv,t) } tv 
# 0.000000 -1.000000 -0.733044 -0.703808 -0.703467 -0.703467 -0.703467 
# -0.703467 
# Check: 

t^2-exp(t) #0 

(-0.703467)^2-exp(-0.703467) # -8.03508e-07 
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t=1; tv=t; for(j in 1:7){ t=t-(t*2-exp(t))/(2*t-exp(t)); tv=c(tv,t) } tv 
# 1.000000 -1.392211 -0.835088 -0.709834 -0.703483 -0.703467 -0.703467 
# -0.703467 


t=-5; tv=t; for(j in 1:7)( t-t-(t^2-exp(t))/(2*t-exp(t)); tv=c(tv,t) }; tv 
# -5.000000 -2.502357 -1.287421 -0.802834 -0.707162 -0.703473 -0.703467 
# -0.703467 


tvec=seq(-6,2,0.01); gvec= tvec^2-exp(tvec) 
X11(w=8,h=4.5); par(mfrow=c(1,1)) 
plot(tvec,gvec,typez"l",Iwdz2,xlabz"t",ylabz"g(t)", main= 
abline(h=0,v=t); points(tv, tv^2-exp(tv),pch-16) 

text( tv[1:4], tv[1:4]^2-exp(tv[1:4]) +3, 0:3) 


Exercise 4.3 Another example of the NR algorithm 


Consider the Bayesian model: 
(x| p) ~ Bin(3, p) 
p~U(0,)), 
and suppose the observed value of x is 2. Find the posterior median of p. 


Solution to Exercise 4.3 


The posterior distribution of p is given by 
(p|x) ~ Beta(1-- 2,141), 
with density 


3-1 2-1 
p (i-p) 2 
x)= =12p*(l- p),O<p<1. 
f(p|x) FGFQ)/TG) pUl=p),0<p 
So, the posterior cdf is 
p p pî 
rofiana 2-2) -4p-3p', 0<p<1. 
0 


To find the posterior median of p we need to solve F(p|x) -1/2, or 
equivalently g(p) 20, where g(p)  F(p|x)-1/2-4p? -3p' -1/2. 


Now, g'(p) 212p? -12p?. So the NR algorithm is defined by iterating 
g(p;) _ 4p; -3p; -1/2 
g(p) ^ ( 12p;-12p; 


Pju = Pj 
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What's a good starting value here? Let's try the MLE, p, =2/3. 


Using this value, we get: 
0.666667, 0.614583, 0.614272, 0.614272, 0.614272, 0.614272, 
0.614272, 0.61427, ..... 


Starting at other values (0.5, 0.9 and 0.1), we get the following (three) 
sequences (respectively): 
0.500000, 0.625000, 0.614306, 0.614272, 0.614272, 0.614272, 
0.614272, 0.614272, ccs. 


0.900000, 0.439403, 0.649191, 0.614501, 0.614272, 0.614272, 
0.614272, 0.614272, ..... 


0.10000, 4.69537, 3.62690, 2.83403, 2.25146, 1.83195, 1.54254, 
1.36156, cases 


The last sequence does not seem to have converged. Let's run this for a 
bit longer. The result is: 
0.10000, 4.69537, 3.62690, 2.83403, 2.25146, 1.83195, 1.54254, 
1.36156, 1.27282, 1.24913, 1.24749, 1.24748, 1.24748, 1.24748, 
1.24748, 1.24748, 1.24748, 1.24748, 1.24748, 1.247468, ..... 


Thus if we start at 0.1, the algorithm converges to an impossible value of 
p, namely 1.24748. 


It appears that the required posterior median is 0.61427. As a check we 
may calculate 


F(p =0.61427 | x) = 4(0.61427)? —3(0.61427)* = 0.499999 «0.5. 


Figures 4.3 and 4.4 show the posterior median 0.61427, as well as the 
other solution of g(p)=0 (i.e. root of g), namely 1.24748. This is not 


actually a solution of F(p |x) = 0.5, because the values of F(p |x) for 
p « 0 and p > 1 are 0 and 1, respectively. 


Thus, the definition of g above is ‘deceptive’, and a better definition is: 


0-1/22-1/2, p«0 
g(p)- F(p|x)-1/2 244p! -3p'-1/2, OX px1 
1-1/2 21/2, p>l. 
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Figure 4.3 Posterior cdf and median of p 


R Code for Exercise 4.3 


options(digits=6); p=2/3; pv=p; for(j in 1:7){ 

p = p - (4*p^3-3*p^4-1/2)/(12*p^2-12*p^3); pv=c(pv,p) }; pv 
# 0.666667 0.614583 0.614272 0.614272 0.614272 0.614272 0.614272 
# 0.614272 


p=0.5; pv=p; for(j in 1:7){ p = p - (4*p^3-3*p^4-1/2)/(12*p^2-12*p^3); 


pv=c(pv,p) }; pv # 0.500000 0.625000 0.614306 0.614272 0.614272 0.614272 
# 0.614272 0.614272 
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p=0.9; pv=p; for(j in 1:7)( p = p - (4*p^3-3*p^4-1/2)/(12*p^2-12*p^3); 
pv=c(pv,p) }; pv # 0.900000 0.439403 0.649191 0.614501 0.614272 0.614272 
# 0.614272 0.614272 


p=0.1; pv=p; for(j in 1:7){ p = p - (4*p^3-3*p^4-1/2)/(12*p^2-12*p^3); 
pv-c(pv,p) }; pv 
& 0.10000 4.69537 3.62690 2.83403 2.25146 1.83195 1.54254 1.36156 


p=0.1; pv=p; for(j in 1:20){ p = p - (4*p^3-3*p^4-1/2)/(12*p^2-12*p^3); 
pv=c(pv,p) }; pv 

# 0.10000 4.69537 3.62690 2.83403 2.25146 1.83195 1.54254 1.36156 
# 1.27282 1.24913 1.24749 1.24748 1.24748 1.24748 1.24748 1.24748 
# 1.24748 1.24748 1.24748 1.24748 1.24748 


4*(0.614272)^3-3*(0.614272)^4 80.499999 
pvec=seq(-0.5,1.4,0.005); Fvec = A*pvec^3-3*pvec^4 
Fvec[pvec<=0] = 0; Fvec[pvec>=1] = 1 


X11(w=8,h=4.5); par(mfrow=c(1,1)) 


plot(pvec,Fvec,type="I",lwd=3,xlab="p",ylab="F(p|x)", main=" ") 
abline(h=0.5,v=0.614272,|ty=3); points(0.614272,0.5,pch=16, cex=1.2) 
abline(h=c(0,1),lty=3); abline(v=c(0,1),Ity=3) 

gvecwrong=4* pvec’3-3*pvec*4-0.5 


plot(pvec, gvecwrong,type="n",|wd=2,xlab="p",ylab="g(p) = F(p|x) - 1/2", 
mainz" ") 

lines(pvec,Fvec-0.5,lwd=3) 

lines(pvec[pvec<O], gvecwrong[pvec<0],lty=2,lwd=3) 

lines(pvec[pvec>1], gvecwrong[pvec>1],|ty=2,lwd=3) 

abline(v=c(0.614272, 1.24748), lty=3); abline(h=0,lty=3) 

points(c(0.614272, 1.24748),c(0,0),pch=16,cex=1.2) 

abline(hzc(-0.5,0,0.5),Ityz3); abline(v2c(0,1),Ityz3) 


4.3 The multivariate Newton-Raphson 
algorithm 


The Newton-Raphson algorithm can also be used to solve several 
equations simultaneously, say 
(Xpres Xg) = D, k= 1K 
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1 g(x) 0 
Let: x=| : |, g(x)2| : |, O=| : | (acolumn vector of length K). 


Xk 9x(X) 0 


Then the system of K equations may be expressed as 
g(x) -0, 

and the NR algorithm involves iterating according to 
xE = xO — g'(x g(x), 

Q) 


x 
where: x? =| : is the value of x at the jth iteration 
x 
Xp g,x?)| faw 
It) — : ; g(x?) = : = : 
x G(x") 9x(Xx) ya 


PN 


g'(x") — [g'G) 


0g,(x)/ Ox" 0g,(x)/Ox -- Og,(X)/ OX, 
albo a} g e 
0g,()/0x' | (0g,(Q)/O0x, ++ OG, (X)/ OX 


Exercise 4.4 Finding a HPDR via the multivariate NR algorithm 
Consider the Bayesian model: (x|4A)-^ Poisson(A) 


and suppose that we observe x = 1. Find the 8096 HPDR for A. 


Solution to Exercise 4.4 
First, f(A|x)oc f (A) f(x| A) »31xe ^A" / x28 ^A, since x = 1. 


Thus (A | x) ~ Gamma(2,1), with f (4|x) 2 4e ^,A» 0. 
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The 8096 HPDR for A is (a,b), where a and b satisfy the two equations: 
F(b|x)- F(a|x) 20.8 (4.1) 


f(b| x)= f(a|x). (4.2) 


Note: Here, f (b| x) is the posterior pdf of A evaluated at b, F(b| x) 
is the posterior cdf of A evaluated at b, etc. Equations (4.1) and (4.2) 
reflect the requirement that 2 € (a,b) with posterior probability 0.8, 
and that the posterior density of 2 must be the same at both a and b, 
considering that A's posterior pdf is bell-shaped and unimodal. 


Thus we wish to solve the equation 
g(t) - 0, 


where: 
0 t F(b|x)-F —0.8 
( ) efe), so- [4 4 (b|x) - F(a|x) ) 
0 b g,(t) f(b| x)— f(a|x) 
The Newton-Raphson algorithm for solving this equation is 


tD = qO g(t) g(t), 
where: 


a. 
(OP zl | 
b, 


qo (2990/20. DD) ò -ae* be” 
I= ag (O/a 0g,(D/O0b) leta- e*(—b)) 


Starting at 


KOJE dy) _ 0.5 
b) (3.0 
(based on a visual inspection of the posterior density f(A|x)=Ae™~), we 


obtain results as shown in Table 4.3. 


Table 4.3 Multivariate NR algorithm 


jl 0 1 2 3 4 5 
0.5  0.0776524 0.163185 0.167317 0.16730 0.16730 


3.0  2.7406883 3.025571 3.079274 3.08029 3.08029 
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It seems that the 8096 CPDR for A is (0.16730, 3.08029). This interval is 
illustrated in Figure 4.5 and appears to be correct. 


As another check on our calculations, we find that: 
f (A =3.08029 | x) - f(A = 0.16730 | x) = 0.14153- 0.14153 2 0 


F(A =3.08029| x) - F(A = 0.16730 | x) = 0.81253— 0.01253 = 0.8. 


Figure 4.5 An 80% HPDR 


0.3 
l 


f(lambda|x) 


0.1 


lambda 


R Code for Exercise 4.4 


gfun = function(a,b){ 
g1=pgamma(b,2,1)-pgamma(a,2,1)-0.8; g22dgamma(b,2,1)-dgamma(a,2,1); 
c(g1,g2) } 


gpfun = function(a,b)| m11-2-dgamma(a,2,1); m12=dgamma(b,2,1) 
m21zexp(-a)*(a-1); m22zexp(-b)*(1-b) 
matrix(c(m11,m12,m21,m22),nrow=2,byrow=T) } 


gvec=c(0.5,3); gmat=gvec; for(j in 1:7){ 
a-gvec[1]; b=gvec[2] 
gvec = gvec - solve(gpfun(a,b)) %*% gfun(a,b) 
gmat = cbind(gmat,gvec) } 


options(digits=6); gmat 


# [1,] 0.5 0.0776524 0.163185 0.167317 0.16730 0.16730 0.16730 0.16730 
3t [2,] 3.0 2.7406883 3.025571 3.079274 3.08029 3.08029 3.08029 3.08029 
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lamvzseq(0,5,0.01); fvzdgamma(lamv,2,1) 

X11(w=8,h=4.5); par(mfrow=c(1,1)) 
plot(lamv,fv,type="I",lwd=3,xlab="lambda",ylab="f(lambda|x)", main=" ") 
abline(h=c(dgamma(a,2,1)),v=c(a,b), Ity=1) 


# Checks: 

c(a,b,dgamma(c(a,b),2,1)) # 0.167300 3.080291 0.141527 0.141527 
c(pgamma(a,2,1), pgamma(b,2,1), pgamma(b,2,1) - pgamma(a,2,1)) 
H 0.0125275 0.8125275 0.8000000 


4.4 The Expectation-Maximisation (EM) 
algorithm 


We have shown how the Newton-Raphson algorithm for solving g(x) = 0 

numerically can be useful for finding the posterior median and the HPDR. 

That algorithm can also be used for finding the posterior mode, when this 

is the solution of 
aly) _ 5 

00 

or equivalently 

log fly) o 
00 i 


b 


In some situations, finding the posterior mode either analytically or via 
the NR algorithm may be problematic because the posterior density 


f (0| y) has a very complicated form. In that case, one may consider 
applying the Expectation-Maximisation (EM) algorithm. 


This algorithm first requires the specification (i.e. definition by the user) 


of some suitable latent data, which we will denote by z, and then the 
application of the following two steps iteratively until convergence. 


Note: The choice of the latent data z will depend on the particular 
application. 


Step |. The Expectation Step (E-Step) 


Determine the Q-function, defined as 
Q, (9) =E, {log f (0] y,z)] y,0;} 


= [log fl y.z) f(z |y,0,)dz, (4.3) 
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or, in words, as 
the expectation of the log-augmented posterior density with respect 
to the distribution of the latent data given the observed data and 
current parameter estimates. 


Step 2. The Maximisation Step (M-Step) 


Find the value of 0 which maximises the Q-function, for example using 
the Newton-Raphson algorithm. 


This value becomes the current parameter estimate in the next iteration. 


Note 1: For mathematical convenience, the Q-function may also be 
defined as at (4.3) but plus and/or multiplied by any constants which do 
not depend on the parameter @. This extended definition allows us to 
ignore terms which have no impact on the final results. If (4.3) is 
multiplied by a negative constant, the resulting Q-function should be 
minimised at Step 2 rather than maximised. 


Note 2: If there is a choice between using the NR algorithm or the EM 
algorithm, one should consider the fact that the EM algorithm is slower 
to converge but far more stable. In fact, under certain regularity 
conditions, the EM algorithm is guaranteed to move closer to the 
required solution at each iteration. By contrast, the NR algorithm may 
not converge at all if started at a value far away from the required 
solution. Thus, one plausible strategy is to use the EM algorithm to 
obtain an approximate solution which is sufficiently close to the correct 
answer, and then to obtain a very high precision using just a few 
iterations of the NR algorithm. 


Exercise 4.5 Illustration of the EM algorithm 


Consider the Bayesian model given by: 
(y Y, |4) ~ iid Gamma(1, A) 
f(A)o 14D. 


Suppose that the data, denoted D, consists of the observed data vector, 
denoted by 
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y, 7X; 
and the partially observed (or missing) data vector, denoted by 


Ym = Vere Yn) - 


We don't know the values in y,, exactly, only that each of those values 
is greater than some specified constant c. 


Suppose that c= 10, n = 5, k= 3 and y, = (3.1, 8.2, 6.9). 


(a) Find the posterior mode of 2 by maximising the posterior density 
directly. 


(b) Find the posterior mode of A using the EM algorithm. 


Solution to Exercise 4.5 
(a) First, f(A|D)x f (4)f(D| A) 


Z f; JT PG, zl 


where:  f(y,|A) » Ae ^" 


P(y, » c|4)» [4e "ay, =e. 


i=l i=k+1 


Then f(A|D)« T Ae ^" | Il e) 
=A" exp{-ALy,, * (n - k)c]), 
where y, =Y, +--+ y, 718.2 (the total of the observed values). 
So 1(A)=log f (4| D) - klogA - A[y,, ^ (n— k)c] 
> 1'(A) -$ tyy +(n-k)c]. 


Setting l'(A) to zero yields the posterior mode, 
k 


——————— —— = 0.078534. 
Yor + (n = k)c 
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(b) The latent data here may be defined as z= y, = (Ypa Y, )- 


Then, the augmented posterior density is 
f (4 | Yo» Ym) «c lH [4e * zm A" exp{—ALy,, + Yor 1} ? 
i-l 
where y,, = y,,, t. y, (the total of the missing values). 


So the log-augmented density is 
log f (4| Yor Yn) = nlogA - ALy,, + y,;]1* c 
(where c, is a constant with respect to A ). 


-Ayi 


A 
Now, f(y, |y; >c,4)= e Ag 0-9, yc 
e 


-Ac 


(an exponential pdf shifted to the right by c). 


Therefore, E(y;|y;»64)- ce. 


It follows that the Q-function is given by 


Q2) -ngi-i ya TE 


J 


(note the distinction here between A and 4, ). 


That concludes the E-Step. 


As regards the M-Step, we now calculate the derivative 


audio - E 
QUT n d. 


J 
Setting this derivative to zero yields a formula for the next value, 
n 


A... =. 4.4 
Yor +(n-k)(c+1/A, ) de 
Implementing the above EM algorithm starting at A4, -1 we get the 


following sequence: 
1.000000, 0.124378, 0.092115, 0.083456, 0.080431, 0.079282, 
0.078832, 0.078653, 0.078581, 0.078553, 0.078542, 0.078537, 
0.078535, 0.078535, 0.078534, 0.078534, ..... 
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We see that the EM algorithm has converged correctly to the answer 
obtained in (a), namely 0.078534. 


Note: Writing (4.4) with 4, =A,,, — A (i.e. the limiting value) gives 
a —À 
Yor +(n—k)(c+1/A) 
and this can be solved easily for the same formula as derived in (a), 
namely 
" k 
Yor *(n-k)c ` 


Thus, in this exercise it was not necessary to actually perform any 
iterations of the EM algorithm. 


R Code for Exercise 4.5 


# (a) 
n=5; k=3; c=10; yo=c(3.1, 8.2, 6.9); yoT=sum(yo); yoT # 18.2 
k/(yoT+(n-k)*c) # 0.078534 


# (b) 

lam = 1; lamv = lam; options(digits=5) 

for(j in 1:20){ lam=n/(yoT+(n-k)*(c+1/lam)); lamv=c(lamv,lam) } 

lamv 

# 1.000000 0.124378 0.092115 0.083456 0.080431 0.079282 0.078832 
# 0.078653 0.078581 0.078553 0.078542 0.078537 0.078535 0.078535 
# 0.078534 0.078534 0.078534 0.078534 0.078534 0.078534 0.078534 


Exercise 4.6 EM algorithm for right-censored Gaussian data 


Consider the Bayesian model given by: 
Qs YA) iid N(w,0°) 
f(u) «luem. 


Suppose that the data, denoted D, consists of the observed data vector 


y, = Qs Y) 
and the partially observed (or *missing") data vector 


Ym = (Yks Yn) - 
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We don’t know the values in y,, exactly, but only that each of these 
values is greater than some specified constant c. 


Suppose that c= 10, n = 5, k= 3 and y, = (3.1, 8.2, 6.9). 


(a) Find the log-posterior density of u and describe how it could be used 
to find the posterior mode of u. (Do not actually find that mode in this 
way.) 


(b) Find the posterior mode of 4 using the EM algorithm. Then check 
your answer by showing the mode in plots of the likelihood and log- 
likelihood functions. 


Solution to Exercise 4.6 


(a) Observe that. f (x| D) œ zi f Cy, | oTi Ply, » c| 2) 


i=1 i=k+1 


k k -50a 1 k 2 
Here, ||fol] fe” -expi-— D-H) 
i=1 i=1 i-1 


20° 


= exp| seal -1)5; + ku-x]) 


k 
where: y,- 23 y, (theobserved sample mean) 
i=l 


oO 


1 < ] 
s? 2——— .—y. y (the observed sample variance). 
i12 00x)" | p ) 


ES 
1 n 071 


ON2z "i 


-P(z > D -i-e(£-4). 
oO oO 


where Z ~ N(0,1) and ®(z) = P(Z < z) (the standard normal cdf). 


Also, P(y;»c|4) =| 


20° o 


Therefore fü D) ce [- i -77 {[1-0(4)) . 
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y P «n-yeg(1- (574) + 
O 


(where c, is a term which does not depend on sz). 


So the log-posterior is 


log f (u| D) = 


To find the posterior mode of 4 we could solve the equation 


l'(u) - 0, 
where ri- 0108 FID) 
Ou 
(u-J,)+- 2-9 «t z) ( t) 
: i o|: - J o c 
o 
This solution could be obtained via the NR algorithm defined by 
nmi» 
jc Co) n ? 
d. Mu 
where l'(u)- 402] 2 
Ou 


As a further exercise, one could complete the formula for !"(4) above 
and actually implement the NR algorithm. 


Note: The posterior mode here is also the maximum likelihood estimate, 
since the prior is proportional to a constant. 


(b) With y, — (y,,,,... Yn) as the latent data, the augmented posterior is 


iade f y, D f y, H 
- 4) a}. 


i=k+1 
1 k 
o -uy +a 


zepi- F L0 y) led- 
20° i=l 


So the log-augmented posterior is 


1 k 
log f (ul Yo Ym) =- > A= BY = 
20° 3 
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"S PN =2 py, RH 
i=k+1 


sh m Ta *((n-By? -24(- Ky, )] *c, 


SQ Hye 


where: y, — 2 y, (the sample mean of the observed values) 
i=1 


Jase LX y, (thesample mean of the missing values). 
i=k+1 


Thus the Q-function may be taken as 
Q; (4) = ku? — 2uky, + (n-k)? -2u(n - Kye, 
= 2nu— 2{ky, +(n-=k)e,}, 
where e, = E(y,|D,uj)) - E(y|D,uj) (i>k). 


We see that e; = Ee IX» View, |. 


where X ^ N(u,0^) (with w taken as a constant). 


Now observe that 
f C) —— dx = Le 


Pop jx P(X>c) P(X >c)’ 


where P(X»c)-1-P(X «9-i-p(z «££)-1i-o( £), 
o o 


and where 


1 
e * dt - uP(X >c) 


7 1 
(c- uy ON2z 


where t — m. and dt = (x— u)dx 


œ 1 ee 
=— f — e7 dt+uP(X >c) 


N27 (c-p)?/2 o 
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2 
-t e-u} /2 XE) 


i +uP(X>c) =o) ^^^ |+uP(X »c) 


V27 


= o il ) +uP(X >c) where ¢(Z) is the standard normal pdf. 
o 


al: 


Thus E(X |X >c) "xx [o (ze) un z 


-usog E) fh o(s) 

c c 

and consequently e; = 4, ee [Eh yx). 
c c 


That completes the E-Step, which may be summarised by writing 
Q,(u) =n’ - 2u(Ky, * (n- k)e;), 
where e, is as given above. 


The M-Step then involves calculating 


Q; (4) = ana 2{ky, Tn —k)e,} 
and setting this to zero so as to yield the next parameter estimate, 
. Ky, (n -k)e, 


ju 
n 


eee 


Implementing the above EM algorithm starting at 5 (arbitrarily), we 
obtain the sequence: 
5.000000, 8.137838, 8.371786, 8.395701, 8.398209, 8.398473, 
8.398501, 8.398504, 8.398504, 8.398504, 8.398504, ..... 


We conclude that the posterior mode of yz is 8.3985. 


Figure 4.6 shows the posterior density (top subplot) and the log-posterior 
density (bottom subplot). Each of these density functions is drawn scaled, 
meaning correct only up to a constant of proportionality. In each subplot, 
the posterior mode is indicated by way of a vertical dashed line. 
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Figure 4.6 Posterior and log-posterior densities (scaled) 
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R Code for Exercise 4.6 


# (b) 
options(digits=6); yo = c(3.1, 8.2, 6.9); n=5; k = 3; c= 10; sig=3; 
yoT=sum(yo); c(yoT, yoT/3) # 18.20000 6.06667 
mu=5; muv2mu; for(j in 1:10){ 
ej = mu + sig * dnorm((c-mu)/sig) / ( 1-pnorm((c-mu)/sig) ) 
mu = ( yoT + (n-k)*ej )/n 
muv=c(muv,mu) } 
muv # 5.00000 8.13784 8.37179 8.39570 8.39821 8.39847 
# 8.39850 8.39850 8.39850 8.39850 8.39850 
modeval-muv[length(muv)]; modeval # 8.3985 


muvec=seq(0,20,0.001); lvec=muvec 
for(i in 1:length(muvec)){ muvalzmuvec[i] 
lvec[i]-(-1/(2*sig^2))*sum((yo-muval)^2) + 
(n-k)*log(1-pnorm((c-muval)/sig)) } 
iopt=(1:length(muvec))[Ivec==max(Ivec)]; muopt=muvec[iopt]; muopt # 8.399 


X11(w=8,h=6); par(mfrow=c(2,1)); 
plot(muvec,exp(Ivec),type="I",lwd=2); abline(v=modeval, lty=2,lwd=2) 


plot(muvec,lvec,type="1",lwd=2); abline(v=modeval,|ty=2,lwd=2) 
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4.5 Variants of the NR and EM algorithms 


The Newton-Raphson and Expectation-Maximisation algorithms can be 
modified and combined in various ways to produce a number of useful 
variants or ‘hybrids’. For example, the NR algorithm can be used at each 
M-Step of the EM algorithm to maximise the Q-function. 


If the EM algorithm is applied to find the mode of a parameter vector, say 
0 — (0,0,) , then the multivariate NR algorithm for doing this may be 


problematic and one may consider using the ECM algorithm (where C 
stands for Conditional). 


The idea is, at each M-Step, to maximise the Q-function with respect to 
0, , with @, fixed at its current value; and then to maximise the Q-function 


with respect to 0,, with 0, fixed at its current value. 


If each of these conditional maximisations is achieved via the NR 
algorithm, the procedure can be modified to become the ECM1 algorithm. 
This involves applying only one step of each NR algorithm (rather than 
finding the exact conditional maximum). In many cases the ECM1 
algorithm will be more efficient at finding the posterior mode than the 
ECM algorithm. 


Sometimes, when the simultaneous solution of several equations via the 
multivariate NR algorithm is problematic, a more feasible solution is to 
apply a suitable CNR algorithm (where again C stands for Conditional). 


For example, suppose we wish to solve two equations simultaneously, say: 
g ,(a »b)=0 
g,(a,b) 2 0, 
for a and b. Then it may be convenient to define the function 
g(a,b) = g,(a, b)" + g,(a, b), 
which clearly has a minimum value of zero at the required solutions for a 
and b. 


This suggests that we iterate two steps as follows: 
Step 1. Minimise g(a,b) with respect to a, with b held fixed. 


Step 2. Minimise g(a,b) with respect to b, with a held fixed. 
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The first of these two steps involves solving 
Og(a, b) -0, 
0a 


where 9g(a, b) - 29, (a,b) iD) , 2g, (a,b) g(a, b) 
Ca Ca ea 


Assuming the current values of a and bare a; and b, , this can be achieved 
via the NR algorithm by setting a, =a, and iterating until convergence 
as follows (k = 0, 1, 2, ...): 


E m) a=a,,b=b, | 
"NM Oa 
kel ^ “k , 
O'g(ab)  , 
E a= a,,b = b, 
and finally setting 
0,4 =a}. (4.5) 


The second of the two steps involves solving 


Og(a,b) _ 
Ob : 
Og(a,b) _ og — b) 0g, (a,b) 
h a a,b) — + 2g,(a,b) ===. 
where E, 2g,(a, b) —— —— + 2g, (a,b) B 


This can be achieved via the NR algorithm by setting b; =b, and iterating 
until convergence as follows (k = 0, 1, 2, ...): 


Og (a,b 
kat — Pk , 
O^ g(a,b 
and finally setting 
b, =b. (4.6) 


A variant of the CNR algorithm is the CNR1 algorithm. This involves 
performing only one step of each NR algorithm in the CNR algorithm. 


In the above example, the CNR1 algorithm implies we set a;,, =a, at 
(4.5) and b,,, =b; at (4.6) (rather than a,,, =a’, and b, , =b; ). 
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This modification will also result in eventual convergence to the solution 
of g,(a,b) «0 and g,(a,b) «0. 


One application of the CNR and CNR1 algorithms is to finding the HPDR 
for a parameter. 


For example, in Exercise 4.4 we considered the model given by 
(x| A) ~ Poisson(A) 
f(4)*«14»0, 

with observed data x = 1. 


The 8096 HPDR for A was shown to be (a,b), where a and b are the 
simultaneous solutions of the two equations: 
g,(a,b) = F(b| x)— F(a |x)-0.8 


g;(a,b) - f(b|x)- f(a] x). 


Applying the CNR or CNR1 algorithm as described above should also 
lead to the same interval as obtained earlier via the multivariate NR 
algorithm, namely (0.16730, 3.08029). 


For further details regarding the EM algorithm, the Newton-Raphson 
algorithm, and extensions thereof, see Lachlan and Krishnan (2008). 


Exercise 4.7 Application of the EM and ECM algorithms to a 
normal mixture model 


Consider the following Bayesian model: 
(y; |R, 45,0) ~L N(u - óR,0?),i 21... n 
(R. R, | 4,6) ~ iid Bernoulli(z), i=1,...,n 
f(u,d)<1, ue, ó»0. 


This model says that each value y, has a common variance o^ and one 
of two means, these being: x if R =0 
H*Ó if R =1. 


Each of the ‘latent’ indicator variables R, has known probability 7 of 
being equal to 1, and probability 1— z of being equal to 0. 
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Note: In more advanced models, the quantity æ% could be treated as 
unknown and assigned a prior distribution, along with the other two 
model parameters, 4 and ó. The model here provides a ‘stepping 
stone' to understanding and implementing such more complex models. 


(a) Consider the situation where n = 100, z = 1/3, u 7 20, 6 - 10 and 
o = 3. Generate a data vector y =(y,,..., y,) using these specifications 
and create a histogram of the simulated values. 


(b) Design an EM algorithm for finding the posterior mode of 0 = (14,0). 
Then implement the algorithm so as to find that mode. 


(c) Modify the EM algorithm in part (b) so that it is an ECM algorithm. 
Then run the ECM algorithm so as to check your answer to part (b). 


(d) Create a plot which shows the routes taken by the algorithms in parts 
(b) and (c). 


Solution to Exercise 4.7 


(a) Figure 4.7 shows a histogram of the sampled values which clearly 
shows the two component normal densities and the mixture density. The 


sample mean of the data is 23.16. Also, 29 of the 100 R; values are equal 
to 1, and 71 of them are equal to 0. 


Figure 4.7 Histogram of simulated data 


a 
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(b) We will here take the vector R=(R,,...,R,) as the latent data. The 
conditional posterior of 4, and 6 given this latent data is 


f (4,6 | y, R) x f (4, 0, y, R) 
= f (4,6) f (R| 4,6) f (y |R, 4,6) 


oc ix] z^ (1-7)^^ Jef- -[u+ RaT} 


« betxexp- ae Y penal). 


i=1 


So the log-augmented posterior density is 


log fu 3|y,R)=- 55 Y», -[u Ré] 


-—— (yi -2y [a Ré] [a RoT) 


1 n n n 
; ASi- pe Role Roy | 
O Via i=l i=l 


= «le 2uny MY yR eni aui R eR 


i-l i-i i=1 
where c, and c, are positive constants which do not depend on 4 or ó 
in any way. We see that 
log f (1456 | y, R) = =c, E 2uny Y y eni + 248R, &9'R, L, 


i=1 


where R, MR. 
i=1 


Note: Each R, equals 0 or 1, and therefore R? = R,. 


So the Q-function is 
Q, (4,8) = E,(log f (4,0 | y, R) | y.) 


= E 2uny 205 ye, tnu 2a c'e 


i=1 


where: e, = E(R,| y, #;,6)) 


ej = E(R, |y, Mj 5)) = De - 
i=1 
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We now need to obtain formulae for the e; values. Observe that 
f(R] y, 4,0) © f (4,0, y, R) 


oc ix z^ (1-z)"^ «TTeol- = (x, -peRaly }. 


It follows that 
(R, | y, 4,0) ~L Bernoulli(e;) , i = 1,...,n, 
where 


C -750 -la+ a) 
Id pal 


Therefore 


reola) 


rex [- (v -| 4, +a J J«a-mes(- 2s» 1) | . 


Thereby the E-Step of the EM algorithm has been defined. 


ej = 


Next, the M-Step requires us to maximise the Q-function. We begin by 
writing: 


aQ (u,6 
COM) c (0- 2ny -0-2nui 26e, +0} 
Ou 
6Q (14,6 n 
WA e fo-0-29 ye, +04 20e, 25e, 
i=1 


Setting both of these derivatives to zero and solving for w and ô 
simultaneously, we obtain the next two values in the algorithm: 


n n 
¥-—) ye, dye; 
=1 TE 
Hia 1 , Ój4 — — Hija 
1 e. Par 
Tj 


The EM algorithm is now completely defined. 
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Starting the algorithm from (44,0,) = (10,1), we obtain the sequence 
shown in Table 4.4. We see that the algorithm has converged to what we 
believe to be the posterior mode, (4,0) = (20.08, 9.72). 


Running the algorithm from different starting points we obtain the same 
final results. Unlike the NR algorithm, we find that the EM algorithm 
always converges, regardless of the point from which it is started. 


Table 4.4 Results of an EM algorithm 


j Hi ô; 

0 10.000 1.000 
1 21.169 3.032 
2 20.321 7.07 
3 19.843 9.139 
4 19.926 9.518 
5 20.005 9.626 
6 20.046 9.674 
7 20.066 9.697 
8 20.075 9.708 
9 20.08 9.713 
10 20.082 9.715 
11 20.083 9.717 
12 20.084 9.717 
13 20.084 9.717 
14 20.084 9.718 
15 20.084 9.718 
16 20.084 9.718 
17 20.084 9.718 
18 20.084 9.718 
19 20.084 9.718 
20 20.084 9.718 


(c) The ECM requires us to once again examine the Q-function, 


Q;(u,o0) = «Le 2uny 289 ye, jf «20e, ey 


i-l 


but now to maximise this function with respect to u and 6 individually 
(rather than simultaneously as for the EM algorithm in (c)). 
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6Q (1,6 
Thus, setting TREE) =-¢, {0 — ny -0 - 2nu + 25e, + 0j 


u 


1 
to zero we get ,,,-y-ó,—e, (after substituting in ô = ó;). 
n 


0Q (4,0 n 
Then, setting au =-C; [o =O 22, yey +0+2ue + 26e,| 
377 
to zero weget Ó,,---———-,,, (same equation as in (c)). 


Tj 
We see that the ECM algorithm here is fairly similar to the EM algorithm. 


Starting the algorithm at (//,,0,) = (10, 1) we obtain the sequence shown 


in Table 4.5 (page 184). We see that the ECM algorithm has converged to 
the same values as the EM algorithm, but along a slightly different route. 


(d) Figure 4.8 (page 185) shows a contour plot of the log-posterior density 
log f (4,Ó | y, R) and the routes of the EM and ECM algorithms in parts 


(b) and (c), each from the starting point (44,0,) = (10, 1) to the mode, 


(4, 5 ) = (20.08, 9.72). Also shown are two other pairs of routes, one pair 
starting from (5, 30), and the other from (35, 20). 


Note 1: In this exercise there is little difference between the EM and 
ECM algorithms, both as regards complexity and performance. In more 
complex models we may expect the EM algorithm to converge faster 
but have an M-Step which is more difficult to complete than the set of 
separate Conditional Maximisation-Steps (CM-Steps) of the ECM 
algorithm. 


Note 2: The log-posterior density in Figure 4.8 has a formula which can 
be derived as follows. First, the joint posterior of all unknowns in the 
model is 


f (4,0, R| y)% f (4,0, y, R) 
sete TE (1-z)"^ JJe- [u+ RoT} 
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"TR 1-R, íl 2 
E q-z) ^ exp - 562 (x, -[u* Ró]) H 


So the joint posterior density of just 4 and 6 is 
f 8| y) - >) f (4,9, R| y) 
R 
n 1 
oc] [ >) 78-2)" exp f- 


Je» (i -Lu* ral) | 


- (eee [- so co] e a- mee [- 0-7] 


i 
20° 


So the log-posterior density of u and 6 is 
I(u,5) =log f (u,0 | y) 


— Steg (reno [- i «y 
+(1-)exp {srl -») 


where c is an additive constant and can arbitrarily be set to zero. 


Note 3: As an additional exercise (and a check on our calculations 
above), we could apply the Newton-Raphson algorithm so as to find the 
mode of l(44, ô). But this would require us to first determine formulae 
for the following rather complicated partial derivatives: 


0l(u,9) Ol(u,d) 01(u9) 01(u9) O°l(u,6) 
DOS oo. aym 


and could prove to be unstable. That is, the algorithm might fail to 
converge if started from a point not very near the required solution. 


Another option is to apply the CNR algorithm (the conditional Newton- 


Raphson algorithm). This would obviate the need for one of the 
2 
derivatives above, AM ,and might be more stable, albeit at the cost 
u 


of not converging so quickly as the plain NR algorithm. 
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As yet another possibility, we could apply the CNR1 algorithm. This is 
the same as the CNR algorithm, except that at each conditional step we 
perform just one iteration of the univariate NR algorithm before moving 


on to the other of the two conditional steps. 


Finally, we could use the R function optim() to maximise l(u, ô). 
Although this function will be formally introduced later, we can report 
that it does indeed find the posterior mode, (/z, 6) = (20.08, 9.72). For 


details, see the bottom of the R code below. 


Table 4.5 Results of an ECM algorithm 


J Hj 
0 10.000 
1 22.505 
2 22.566 
3 21.905 
4 21.139 
5 20.611 
6 20.322 
7 20.181 
8 20.118 
9 20.093 
10 20.085 
11 20.083 
12 20.083 
13 20.083 
14 20.084 
15 20.084 
16 20.084 
17 20.084 
18 20.084 
19 20.084 
20 20.084 
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1.000 
1.696 
3.882 
6.811 
8.729 
9.501 
9.732 
9.774 
9.764 
9.746 
9.732 
9.725 
9.720 
9.719 
9.718 
9.718 
9.718 
9.718 
9.718 
9.718 
9.718 
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Figure 4.8 Routes of the EM and ECM algorithms 


delta 


R Code for Exercise 4.7 


# (a) 
X11(w=8,h=4.5); par(mfrowzc(1,1)); options(digits=4) 
ntrue=100; pitrue=1/3; mutrue=20; deltrue=10; sigtrue=3 


set.seed(512); Rvec=rbinom(ntrue,1,pitrue); sum(Rvec) # 29 
yvec=rnorm(ntrue,mutrue+deltrue*Rvec,sigtrue) 
ybar=mean(yvec); ybar # 23.16 


hist(yvec,prob=T, breaks=seq(0,50,0.5),xlim=c(10,40), ylim=c(0,0.2), 


xlab="y", mainz" ") 
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yv=seq(0,50,0.01); lines(yv,dnorm(yv,mutrue,sigtrue),Ityz2,Iwdz2) 

lines(yv,dnorm(yv,mutrue+deltrue, sigtrue),lty=2,lwd=2) 

lines(yv, (1-pitrue)*dnorm(yv,mutrue,sigtrue)+ 
pitrue*dnorm(yv,mutrue+deltrue,sigtrue), Ity=1,lwd=2) 

legend(10,0.2,c("Components","Mixture"),lty=c(2,1),lwd=c(2,2)) 


# (b) 

evalsfun= function(y=yvec, piizpitrue, mu=mutrue,del=deltrue,sig=sigtrue){ 

# This function outputs (e1,e2,...,en) 
term1vals=pii*dnorm(y,mu+del,sig) 
termOvals=(1-pii)*dnorm(y,mu,sig) 
term1vals/(term1vals+termOvals) } 


EMfun=function(J=20, mu=10, del=1, y=yvec, pii=pitrue, sig=sigtrue){ 
muv=mu; delv=del; ybarzmean(y); n=length(y) 
for(j in 1:J){ 
evals=evalsfun(y=y, pii=pii, mu=mu, del=del, sig=sig) 
sumyevals = sum(y*evals); sumevals=sum(evals) 
mu=(ybar-sumyevals/n) / (1-sumevals/n) 
del=sumyevals/sumevals - mu 
muv=c(muv,mu); delv=c(delv,del) 
} 
list(muv=muv,delv=delv) 
} 
EMres=EMfun(J=20, mu=10, del=1,y=yvec,pii=pitrue,sig=sigtrue) 
outmat = cbind(0:20,EMres$muv, EMresSdelv) 
print.matrix <- function(m){ write.table(format(m, justify="right"), 
row.names=F, col.names=F, quote=F) } 
print.matrix(outmat) 
0.000 10.000 1.000 
1.000 21.169 3.032 
2.000 20.321 7.070 
3.000 19.843 9.139 
4.000 19.926 9.518 
5.000 20.005 9.626 


Ga- dt ck cb ck oct 


# 16.000 20.084 9.718 
# 17.000 20.084 9.718 
# 18.000 20.084 9.718 
# 19.000 20.084 9.718 
# 20.000 20.084 9.718 


muhat=EMresSmuv[21]; delhat=EMresSdelv[21]; 
c(muhat,delhat) # 20.084 9.718 
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# (c) 
CEMfun=function(J=20, mu=10, del=1, y=yvec, pii=pitrue, sig=sigtrue){ 
muv=mu; delv=del; ybarzmean(y); n=length(y) 
for(j in 1:J){ 
evals=evalsfun(y=y, pii=pii, mu=mu, del=del, sig=sig) 
sumyevals = sum(y*evals); sumevals=sum(evals) 
mu=ybar-del*sumevals/n 
del=sumyevals/sumevals - mu 
muvzc(muv,mu); delv=c(delv, del) 
} 
list(muv=muv,delv=delv) 
} 
CEMres=CEMfun(J=20, muz10, del=1,y=yvec,pii=pitrue,sig=sigtrue) 
outmat2 = cbind(0:20, CEMresSmuv, CEMresSdelv) 
print.matrix(outmat2) 


0.000 10.000 1.000 
1.000 22.505 1.696 
2.000 22.566 3.882 
3.000 21.905 6.811 
4.000 21.139 8.729 
5.000 20.611 9.501 


Tt dt Gk HHH 


# 16.000 20.084 9.718 
# 17.000 20.084 9.718 
# 18.000 20.084 9.718 
# 19.000 20.084 9.718 
# 20.000 20.084 9.718 


# (d) 

X11(w=8,h=9); par(mfrowzc(1,1)) 

logpostfun=function(mu=10,del=10,y=yvec, pii=pitrue,sig=sigtrue){ 
sum(log(pii*dnorm(y,mu+del,sig)+(1-pii)*dnorm(y,mu,sig))) } 

mugrid=seq(0,35,0.5); delgrid=seq(0,30,0.5) 

logpostmat=as.matrix(mugrid %*% t(delgrid)) 

dim(logpostmat) #41 21 OK 


for(i in 1:length(mugrid)) for(j in 1:length(delgrid)) logpostmat{[i,j] = 
logpostfun(mu=mugrid[i],del=delgrid[j],y=yvec, pii=pitrue,sig=sigtrue) 


contour(x=mugrid, y=delgrid, z=logpostmat, nlevels=20, 
xlabz"mu", ylab="delta"); points(muhat,delhat, pch=16,cex=1.2) 


points(10,1,pch=16,cex=1.2) 
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EMres=EMfun(J=20, mu=10, del=1,y=yvec, pii=pitrue,sig=sigtrue) 
CEMres=CEMfun(J=20, mu=10, del=1,y=yvec,pii=pitrue,sig=sigtrue) 
lines(EMresSmuv, EMresSdelv,|ty=1,lwd=3) 

lines(CEMresSmuv, CEMresSdelv,|ty=2,lwd=3) 


points(5,30,pch=16,cex=1.2) 

EMres=EMfun(J=50, mu=5, del=30,y=yvec, pii=pitrue, sig=sigtrue) 
CEMres=CEMfun(J=50, mu=5, del=30, y=yvec, pii=pitrue, sig=sigtrue) 
lines(EMresSmuv, EMresSdelv,ty=1,lwd=3) 

lines(CEMresSmuv, CEMresSdelv,|ty=2,lwd=3) 


points(35,20,pch=16,cex=1.2) 

EMres=EMfun(J=50, mu=35, del=20,y=yvec, pii=pitrue, sig=sigtrue) 
CEMres=CEMfun(J=50, mu=35, del=20, y=yvec, pii=pitrue,sig=sigtrue) 
lines(EMresSmuv, EMresSdelv,Ity=1,lwd=3) 

lines(CEMresSmuv, CEMresSdelv, Ity=2,lwd=3) 
legend(21,30,c("EM","ECM"),Ity=c(1,2),lwd=c(3,3)) 


# Note 2. Maximisation of the logposterior density of mu and delta using optim() 
logpostfun2=function(theta=c(10,1),y=yvec, pii=pitrue, sig=sigtrue){ 
-sum(log(pii*dnorm(y,theta[1]+theta[2],sig)+ 
(1-pii)*dnorm(y,theta[1],sig))) 
) 
res=optim(par=c(10,1),fn= logpostfun2)Spar; res # 20.08 9.72 
res=optim(par=c(5,30),fn= logpostfun2)Spar; res # 20.085 9.716 
res=optim(par=c(35,20),fn= logpostfun2)Spar; res # 20.084 9.716 
res=optim(par=res,fn= logpostfun2)Spar; res # 20.084 9.718 
# Here we fine-tune the answer by starting at the previous solution. 


4.6 Integration techniques 


Bayesian inference typically involves a great deal of integration (and/or 
summation). For example, consider the posterior density 


f(O|y)=60°,0<0<1 
(which featured in previous exercise involving the binomial-beta model) 


and suppose that we wish to find the posterior mean estimate of A = 6°. 
This estimate is 


À — E(0? | y) - fo x (60°) dd =0.75. 
0 


But what if this integral did not have a simple analytical solution? 
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In that case, we could consider a number of other strategies. First, we 
might re-express the posterior mean as 
X= frfalyda, 
where, using the method of transformation, 
sil... 
fQl)- FEIN - (7) EA 


de =3\7, 0«A«1, 
dA 


so that 


1 
X= [ X(3\?)d\ 2075. 
fa) 


If this strategy does not help, we may then consider using a numerical 
integration technique. 


For example, we could apply the integrate() function in R to get A = 0.75, 
as follows: 
gfun = function(t){ 6*t^7 } # Define the function to be integrated 
integrate(f=gfun, lower=0,upper=1)Svalue # 0.75 


In some cases the function requiring integration is very complicated or 
does not have a closed form expression. In that case, direct application of 
the integrate() function may not work or be practicable, and then it may 
be useful to apply the trapezoidal rule or Simpson’s rule to evaluate the 
integral. 


When working in R, the following is often a convenient strategy: 
(i) evaluate g(0) = 0? x 60? at each 0 on the grid 
0, 0.1, 0.2, ..., 0.9, 1 (say); then 
(ii) create a spline through these points, using the fit() and predict() 
functions; and then 
(iii) find the area under this spline using the integrate() function. 


Applying this method (see the R code below for details) yields 0.7558 as 


an estimate of A. Repeating, but with the evaluations on the grid 0.01, 
0.02, ...,1 yields 0.7500. Repeating again, but with evaluations on the grid 
0.001, 0.002, ..., 1 yields 0.7500. It appears that a limit has been reached 
and that using a finer grid would not result in any improvements to the 
results of this numerical procedure. 


We may conclude that À — 0.7500 (to 4 decimals). 
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R Code for Section 4.6 


gfun = function(t)( 6*t^7 } 4 Define the function to be integrated 
integrate(f=gfun,lower=0,upper=1)Svalue # 0.75 


INTEG <- function(xvec, yvec, a = min(xvec), b = max(xvec)){ 
# Integrates numerically under a spline through the 
# points given by the vectors xvec and yvec, from a to b. 
fit <- smooth.spline(xvec, yvec) 
spline.f <- function(x){predict(fit, x)Sy ) 
integrate(spline.f, a, b)Svalue } 


gfun=function(t){ 6*t^7 } 

tvec <- seq(0,1,0.1); gvec <- gfun(tvec) 
INTEG(tvec,gvec,0,1) # 0.755803 

tvec «- seq(0,1,0.01); gvec <- gfun(tvec) 
INTEG(tvec,gvec,0,1) #0.75 

tvec <- seq(0,1,0.001); gvec <- gfun(tvec) 
INTEG(tvec,gvec,0,1) #0.75 


Exercise 4.8 Numerical integration 


Suppose that X ~ N(,0°) and Y=(X |X >c) where w = 8, 0 = 3 
and c= 10. Find EY using numerical techniques and compare your answer 
with the exact value, 


wrod (^ 


which was derived analytically in Exercise 4.6. 


Solution to Exercise 4.8 


The required integral is 


EY = [aco ; 


i _ f(x) Ldbg x 
where: IO) = gx » 0) ; fe)- 4[ = ) , 
P(x »9) -1-o (8-4). 

Oo 


Applying the integrate() function directly to g(x) we get EY = 11.7955. 
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Applying the INTEGY() function (defined in Section 4.6) with coordinates 
given by (10,10.1,10.2,...,30) and (g(10), g(10.1), g(10.2),..., 9(30)), we 
also get EY = 11.7955. The exact value of EY is in fact 


eee] me 


Note: If we use the integrate() function with bounds from 10 to 20 rather 
than 10 to 30, we get 11.7929, which is slightly in error. Exactly the 
same happens with the INTEG() function. Thus, when using either of 
these functions, care must be taken to choose a large enough range. 
Ideally, we will sketch the integrand function and make sure the range 
of integration is sufficiently broad to cover all important regions (where 
the integrand is significantly positive). In practice, it is useful to 
gradually increase the range of integration until the answer stops 
changing. Likewise, it is useful to gradually increase the grid density 
chosen for the INTEG() function until the answer stops changing. 


R Code for Exercise 4.8 


# First declare the function INTEG() as defined in the previous exercise 


muz8; sig=3; c = 10; options(digits=6) 

PXpos = (1-pnorm((c-mu)/sig)) 

gfun=function(x){ x * dnorm(x,mu,sig) / PXpos } 

integrate(gfun,c,20)Svalue # 11.7929 

integrate(gfun,c,30)Svalue # 11.7955 

xvec <- seq(c,20,0.1); gvec <- gfun(xvec); INTEG(xvec,gvec,c,20) #11.7929 
xvec <- seq(c,30,0.1); gvec <- gfun(xvec); INTEG(xvec,gvec,c,30) # 11.7955 
true=mu + sig*dnorm((c-mu)/sig)/(1-pnorm((c-mu)/sig)); true # 11.7955 


Exercise 4.9 Double integration 


Use the integrate() and INTEG() functions in at least two different ways 
so as to calculate the double integral 


I= j j ia 
x=0\ t=0 


Illustrate your calculations with suitable graphs of the relevant functions 
involved. 
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Solution to Exercise 4.9 


Using the integrate() function alone (and not the INTEG() function), the 
integral can be worked out as follows: 


integrate(function(x) { 
sapply(x, function(x) { 
integrate(function(t) { 
sapply(t, function(t) tt ) 
}, 0, x^3)Svalue }) }, 0, 1) 


# 0.192723 with absolute error < 7.8e-10 


Another approach is as follows. Observe that 
1 


I= | g(x) dx, 
where i l 

g(x) = [ h(t)dt 
and = 


he) =t. 


We will now use the integrate() function to obtain g(x) for each value of 


x in the grid 0, 0.01, 0.02, ..., 1. We will then apply the INTEG() function 
to the resulting coordinates. 


Figure 4.9 below displays the two functions h(t) and g(x). The value 
g(0.8) = 0.381116 is the area under h(t) between 0 and 0.8. The total area 
under h(t) (from 0 to 1) is 0.78343. 


The total area under g(x) (from 0 to 1) is estimated as 0.192723. Using 


the grid 0, 0.001, 0.002, ..., 1 also leads to 0.192723, whereas using the 
grid 0, 0.1, 0.2, ..., 1 leads to 0.193054. 


We conclude that the exact value of the required integral I to 4 decimals 


is 0.1927, which is in agreement with the first approach above which 
doesn't make use of the INTEG() function. 
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One could also adapt the second approach above so as to calculate the 
double integral using the INTEG() function only (without using the 
integrate() function directly). This might be useful if the inner integral 


x) 


g(x)- f h(t)dt where h(t) 2 t 
t=0 
could not be evaluated easily using integrate() directly, for example if 
h(t) were a very complicated function which could not be expressed in 
closed form. 


Note: The integrate() function is called within the INTEG() function and 
so is used at least indirectly in all of the approaches considered here. 


Figure 4.9 Two functions 
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R Code for Exercise 4.9 


integrate(function(x) { 
sapply(x, function(x) { 
integrate(function(t) { 
sapply(t, function(t) t^t ) 
}, 0, x^3)Svalue ]) }, 0, 1) 
4 0.192723 with absolute error « 7.8e-10 
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# Declare the function INTEG() as defined in the previous exercise 
options(digits=6); X11(w=8,h=6); par(mfrowzc(2,1)) 


hfun= function(t){ tt } 
tvec=seq(0,1,0.01); hvec=hfun(tvec) 
plot(tvec,hvec,type="I",xlab="t" ,ylab="h(t)",lwd=2); abline(vz0.8,Ityz2) 


integrate(f=hfun,lower=0,upper=0.8%3)Svalue 

#0.381116 This is g(0.8) = area under h(t) to left of 0.8 
integrate(f=hfun,lower=0,upper=1)Svalue 

# 0.78343 This is the total areas under h(t) (from O to 1) 


xvec = seq(0,1,0.01); gvec = rep(NA,length(xvec)) 
for(i in 1:length(xvec)){ xval = xvec[i] 

gvec[i] = integrate(f=hfun, lower=0,upper=xval43)Svalue } 
INTEG(xvec,gvec) # 0.192723 
plot(xvec,gvec,type="I",xlab="x"",ylab="g(x)",lwd=2) 
points(0.8, 0.381116 , pch=16, cex=1) 


# Apply INTEG() using different grids 


xvec = seq(0,1,0.001); gvec = rep(NA,length(xvec)) 
for(i in 1:length(xvec)){ xval = xvec[i] 

gvec[i] = integrate(f=hfun, lower=0,upper=xval43)Svalue } 
INTEG(xvec,gvec) # 0.192723 


xvec = seq(0,1,0.1); gvec = rep(NA,length(xvec)) 
for(i in 1:length(xvec)){ xval = xvec[i] 

gvec[i] = integrate(f=hfun, lower=0,upper=xval*3)Svalue } 
INTEG(xvec,gvec) # 0.193053 


4.7 The optim() function 


The function optim() in R is a very useful and versatile tool for 
maximising or minimising functions, both of one and of several variables. 


This R function can also be adapted for solving single or simultaneous 
equations and provides an alternative to other techniques such as trial and 


error, the Newton-Raphson algorithm and the EM algorithm. 


The second of the next two exercises shows how the optim() function can 
be used to specify a prior distribution. 
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Exercise 4.10 Simple examples of the optim() function 


Use the optim() function to ‘find’ the mode of each of the following: 


(a) gG) 2 xe, x>0 (mode - 2/5) 
|x à e C» 
(b) g(x) = CHRIS C xe (the mode has no closed form) 
+|x 


-ylx-1y «(x-3) } 


,xe9,y»0 
(mode = (x, y) = ((1 + 3)/2, 3/2)). 


(c) g(x, y) ye 


Solution to Exercise 4.10 


In each of these cases, the optim() function (which minimises a function 
by default) may be applied to the negative of the specified function (so as 
to maximise that function). 


(a) The function correctly returns x 2 2/5. (NB: The warning message 
may be ignored.) 


(b) The function returns a value of 1.5047. (We presume that this is 
correct; see below for a verification.) 


(c) The mode is correctly computed as (x, y) 2 (2,1.5). (Note that this 


solution is obvious by analogy with maximum likelihood estimation of 
the normal mean and variance.) 


Figure 4.10 illustrates these three solutions, with each mode being marked 
by a dot and vertical line. Subplot (c) shows several examples of the 
function g(x, y) in part (c) considered as a function of only x, with each 
line defined by a fixed value of y on the grid 0, 0.5, 1, ...,4.5, 5. 
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Figure 4.10 Maximisation of function g in parts (a), (b) and (c) 
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R Code for Exercise 4.10 
help(optim); options(digits=5); X11(w=8,h=8); par(mfrow=c(3,1)) 


# (a) 
fun=function(x){ -x^2 * exp(-5*x) } 
resO=optim(par=0.5,fn=fun)Spar; resO # 0.4 
# Warning message: 
# In optim(par = 0.5, fn = fun) : 
# one-diml optimization by Nelder-Mead is unreliable: 
# use "Brent" or optimize() directly 
plot(seq(0,5,0.01), -fun(seq(0,5,0.01)),type="1",lwd=3,xlab="x",ylab="g(x)"); 
abline(vzresO); points(resO, -fun(resO), pch=16, cex=2); text(4,0.02,"(a)",cex=2) 


# (b) 

fun=function(x){ -exp(-(x-1)^2) * abs(x)4x/(1+abs(x)) } 
resO=optim(par=1,fn=fun)Spar; resO # 1.5047 

plot(seq(-2,5,0.01), -fun(seq(-2,5,0.01)),type="I" ,lwd=3, xlab="x",ylab="g(x)"); 
abline(vzresO); points(resO, -fun(resO), pch=16, cex=2); text(4,0.45,"(b)",cex=2) 
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# (c) 
fun=function(v){ -v[2]^3 * exp( -v[2] * ((v[1]-1)^2 + (v[1]À)^2) ) } 
resO=optim(par=c(2,2),fn=fun, lower = c(-Inf,O), upper = c(Inf,Inf), 
method = "L-BFGS-B")Spar; resO # 2.0 1.5 


fun2=function(x,y){ y^3 * exp( -y * ( (x-1)42 + (x-3)42) ) } 


plot(c(0.5,3.5),c(0,0.2), typez"n",xlabz"x",ylabz"f(x,y)") 
for(y in seq(0,5,0.5)) 

lines(seq(0,5,0.01), fun2(x=seq(0,5,0.01),y=y), Ityz1) 
abline(v=resO[1]); points(resO[1],fun2(resO[1],resO[2]), pch=16, cexz2); 
lines(seq(0,5,0.01),fun2(x= seq(0,5,0.01), y=resO[2]),lty=1,lwd=3); 
text(3,0.17,"(c)",cex=2) 


Exercise 4.1 | Specification of parameters in a prior 
distribution using the optim() function 


Consider the normal-gamma model given by: 
(us Y, 13) ~ iid N(j,1/ À) 
A~ G(r,1). 


Use the optim() function in R to find the values of 7; and 7 which 
correspond to a prior belief that the population standard deviation 
c -1/AlÀ lies between 0.5 and 1 with 9596 probability, and that o is 
equally likely to be below 0.5 as it is to be above 1. 


Solution to Exercise 4.11 


We wish to find the values of 7; and 7 which satisfy the two equations: 
P(o<a)=a/2 and P(o<b)=1-a/2, 
where a = 0.5, b = 1 and a = 0.05. 


These two equations are together equivalent to each of the following five 
pairs of equations: 


P(o? <a’)=a/2 and | P(o° «b^)-1-a/2 
P(1I/A«a)2a/2 and P(1/A<b’)=1-a/2 
P(1/a! <A)=a/2 and —P(I/b^ <A) =1-a/2 
P(A <1/a*)=1-a/2 and | P(A<1/b*)=a/2 


Ej,540/a)-(0-«/2)20 add  F4,,/b)-a/2-0. 
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We now focus on the last of these pairs of two equations. Two obvious 
ways to solve these equations are by trial and error and via the multivariate 
Newton-Raphson algorithm, as illustrated earlier. But the solution can be 
obtained more easily by using the optim() function to minimise 


g(1,7) =| F4,,0/a) -Q-a/2)] +| Fogg (h/ 5) - (a/2)] 


2 


Note: Clearly, this function has a value of zero at the required values of 
mand 


With the default settings and starting at 7 = 0.2 and 7 = 6, optim() 
produced some warning messages (which we ignored) and provided the 
solution, 7 = 8.4764 and 7 = 3.7679. 


Now, this solution is not exactly correct, because the probabilities of a 
Gamma(8.4764, 3.7679) random variable lying below 1/b^ = 1 and 


below 1/a^ = 4, respectively, are 0.025048 and 0.975104 (i.e. not exactly 
0.025 and 0.975 as desired). 


However, applying the optim() function again but starting at the previous 
solution, namely 7 = 8.4764 and 7 = 3.7679, yielded a ‘refined’ 
solution, 7 = 8.4748 and 7 = 3.7654. 


This solution may be considered correct, because the probabilities of a 
Gamma(8.4748, 3.7654) random variable being less than 1/b* =1 and 
less than 1/a* = 4 , respectively, are exactly 0.025 and 0.975. 


Discussion 


It is instructive to derive and plot the corresponding density of the 
precision parameter 4 , and then to do this also for the variance parameter 


o^ =A" and the standard deviation parameter o = 4™* , respectively. 


The three densities are plotted in Figure 4.11 (in the stated order from top 
to bottom). The vertical lines show the 0.025 and 0.975 quantiles of each 
distribution. The formulae for the three densities are as follows: 
UP RR 
f(A) = fog (4) * — ——,. 4>0 
WP x0) 
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5 


2 2 dÀ 
f(o"°)= fican (0 )= o 


= foun l4 =P 


where 4 - (o^) ! 


_ 7" (1/ o^ te") 
TG 
dA ao za -2 
fie) ray =e =e )|-20 where 4 = (0) 


(o )" 6 »0 


n 2\n-1 ,-r(l/o?) 
LIU e (lig Je 20”, o>0. 
(77) 


As a check on the last of these three densities, the integrate() function was 
used to show that the area under that density is exactly 1, and that the areas 
underneath it to the left of 0.5 and to the right of 1 are both exactly 0.025. 


Figure 4.11 Three prior densities 
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R Code for Exercise 4.11 


options(digits=5); a=0.5; b=1; alp=0.05; 

fun=function(v,alp=0.05,a=0.5,b=1){ 
(pgamma(1/a^2,v[1],v[2])-(1-alp/2))^2 + 
(pgamma(1/b^2,v[1],v[2])-(alp/2))^2 } 


resO=optim(par=c(0.2,6),fn=fun)Spar 
resO # 8.4764 3.7679 
pgamma(c(1/b^2,1/a^2),resO[1],resO[2]) # 0.025048 0.975104 Close 


res-optim(par-resO,fn-fun)Spar; res # 8.4748 3.7654 
pgamma(c(1/b^2,1/a^2),res[1],res[2]) # 0.025 0.975 Correct 


res2=optim(par=c(6,3),fn=fun)Spar; res2 # 8.4753 3.7655 
pgamma(c(1/b^2,1/a^2),res2[1],res2[2]) # 0.024992 0.974996 Close 


res3=optim(par=res2,fn=fun)Spar; res3 # 8.4748 3.7654 
pgamma(c(1/b^2,1/a^2),res3[1],res3[2]) # 0.025 0.975 Correct 


par(mfrow=c(3,1)); tv=seq(0,10,0.01) 


plot(tv, dgamma(tv,res[1],res[2]),type="I",lwd=2, xlim=c(0,6), 
xlab="lambda",ylab="density"); abline(v2c(1/a^2,1/b^2)); 
abline(h=0,|ty=3) 


plot(tv,dgamma(1/tv,res[1],res[2])/tv^2, type="I", lwd=2, xlim=c(0,1.5), 
xlabz"sigma^2" ylabz"density"); 
abline(v2c(a^2,b^2)); abline(hzO,Ityz3) 


plot(tv,dgamma(1/tv^2,res[1],res[2])*2/tv^3, type="I", lwd=2, 
xlim=c(0.35,1.4), xlabz"sigma",ylabz"density"); 
abline(v=c(a,b)); abline(h=0,Ity=3) 


# Check areas under the last curve 

func=function(t){ dgamma(1/t^2,res[1],res[2])*2/t^3 } 
integrate(func,lower=0,upper=Inf)Svalue # 1 Correct 
integrate(func,lower=0,upper=0.5)Svalue # 0.025 Correct 
integrate(func,lower=1,upper=Inf)Svalue # 0.025 Correct 


200 


Monte Carlo Basics 


5.1 Introduction 


The term Monte Carlo (MC) methods refers to a broad collection of tools 
that are useful for approximating quantities based on artificially generated 
random samples. These include the Monte Carlo integration (for 
estimating an integral using such a sample), the inversion technique (for 
generating the required sample), and Markov chain Monte Carlo methods 
(an advanced topic in Chapter 6). In principle, the approximation can be 
made as good as required simply by making the Monte Carlo sample size 
sufficiently large. As will be seen (further down), Monte Carlo methods 
are a very useful tool in Bayesian inference. 


To illustrate the basic idea of Monte Carlo methods, consider Buffon's 
needle problem, where a needle of length 10 cm (say) is dropped randomly 
onto a floor with parallel lines being distance 10 cm apart. What is p, the 
probability of the needle crossing a line? The exact value of p can be 
worked out analytically as 2/ z = 0.63662 (this is done in one of the 
exercises below). But this takes mathematical effort. If this analytical 
solution were not possible (or just too much work), we could instead 
estimate p via Monte Carlo. The simplest way to do this would be to toss 
the needle onto the floor 1,000 times (randomly and independently). If the 
needle crosses a line 641 times (say), then the Monte Carlo estimate of p 
is just 641/1,000 = 0.641. 


As a variation on this physical experiment (which could be rather 
laborious), we could toss the needle 1,000 times virtually, meaning that 
we simulate each drop (or rather the parameters of each drop) on a 
computer and each time determine whether the virtual needle has crossed 
a virtual line. 


This method will be faster and more accurate; but it will also require at 
least some mathematical work to identify exactly what the parameters of 
each drop are and what configuration of those parameters correspond to 
the needle crossing a line (again, this is done in one of the exercises 
below). 
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In this chapter, we will first discuss Monte Carlo methods and their 
usefulness under the assumption that we have available or can generate 
the required random samples. As we will see in the exercises and their 
solutions, such samples can often be obtained very easily using inbuilt R 
functions, e.g. runif() and rnorm(). 


After this we will describe special methods for generating a random 
samples, starting with the simplest, such as the inversion technique and 
rejection sampling. We reserve the more complicated techniques which 
involve Markov chain theory to the next and later chapters. 


Also, as part of the structure of the present chapter, we will first discuss 
Monte Carlo methods and random number generation in a fully general 
setting. Only after we have finished our treatment of these two topics (to 
a certain level at least) will we discuss their application to Bayesian 
inference. Hopefully this format will minimise any confusion. 


5.2 The method of Monte Carlo integration for 
estimating means 


One of the most important applications of Monte Carlo methods is the 
estimation of means. Suppose we are interested in 4, the mean of some 


distribution defined by a density f(x) (or by a cumulative distribution 
function F(x)), but we are unable to calculate 4 exactly (or easily), for 
example by applying the formula 


u= Ex = | xf (x)dx 
(or uz Ex= Y xf) Or u= Ex= [xdF Q9). 


Also suppose, however, that we are able to generate (or obtain) a random 
sample from the distribution in question. Denote this sample as 


Xp- X; ~iid f(x) 
(or XX; tid F(x})). 


Then we may use this sample to estimate u by 
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Also, a 1— « confidence interval (CT) for 4 given by 


CI 2 (x tz,,s/ 4J), 
where 


1 J 
2 Eva va 
S =—— ) (x,—X 
PE 
is the sample variance of the random values. 


In this context we refer to: 


Xing, as the Monte Carlo sample values 
or the Monte Carlo sample 

X as the Monte Carlo sample mean 
or the Monte Carlo estimate 

CI as the Monte Carlo 1— o confidence interval 
for u 

J as the Monte Carlo sample size 

° as the Monte Carlo sample variance 
S as the Monte Carlo sample standard deviation 
s/ JJ as the Monte Carlo standard error (SE). 


Three important facts here are that: 


e X is unbiased for u (i.e. Ex = 4) 

e the CI has coverage approximately 1— æ , by the central limit 
theorem 

* the width of the CI converges to zero as the MC sample size 
J tends to infinity. 


Exercise 5.1 Monte Carlo estimation of a known gamma mean 


(a) Use the R function rgamma() to generate a random sample of size 
J = 100 from the Gamma(3,2) distribution, whose mean is 4 = 3/2- 1.5. 
Then use the method of Monte Carlo to produce a point estimate 4/ and a 
95% CI for 4u. 


(b) Repeat (a) but with MC sample sizes of 1,000 and 10,000, and discuss 
the results. 
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Note: In this exercise we are focusing on the integral 
ce) Dri Clan 
= | xf (x)dx = | x| ————— |dx, 
u= | xf(x) jal E 
showing how it could be estimated via MC if it were not possible to 


evaluate analytically. Exactly the same approach could be applied if the 
integral were impossible to evaluate. 


Solution to Exercise 5.1 


(a) Applying the above procedure (see the R code below) we estimate 4 
by x = 1.5170. The Monte Carlo 95% confidence interval for 4 is 


CI = (X £ z, 5 / NJ) = (1.3539, 1.6800). 


We note that x is ‘close’ to the true value, “ = 1.5, and the CI contains 
that true value. 


(b) Repeating (a) with J = 1,000 we obtain the point estimate 1.5199 and 
the interval estimate (1.4658, 1.5740). 


Repeating (a) with J = 10,000 we obtain the point estimate 1.4942 and the 
interval estimate (1.4773, 1.5110). 


As in (a) we note in each case that X is ‘close’ to 4 , and the CI contains 
4. We also note that as J increases the MC point estimate tends to get 
closer to 4, and the 95% CI tends to get narrower. (The widths of the 
three CIs are 0.3261, 0.1081 and 0.0337.) 


R Code for Exercise 5.1 


options(digits=4); J = 100; set.seed(221); xv=rgammai(J,3,2) 
xbar=mean(xv); s=sd(xv); ci=xbar + c(-1,1)*qnorm(0.975)*s/sqrt(J) 
c(xbar,s,s^2,ci,ci[2]-ci[1]) # 1.5170 0.8320 0.6921 1.3539 1.6800 0.3261 


J = 1000; set.seed(231); xv=rgamma(J,3,2) 

xbar=mean(xv); s=sd(xv); ci=xbar + c(-1,1)*qnorm(0.975)*s/sqrt(J) 
c(xbar,s,s^2,ci,ci[2]-ci[1]) # 1.5199 0.8722 0.7607 1.4658 1.5740 0.1081 
J = 10000; set.seed(211); xv=rgamma(J,3,2) 

xbar=mean(xv); s=sd(xv); ci=xbar + c(-1,1)*qnorm(0.975)*s/sqrt(J) 
c(xbar,s,s^2,ci,ci[2]-ci[1]) # 1.4942 0.8597 0.7391 1.4773 1.5110 0.0337 
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5.3 Other uses of the MC sample 


Once a Monte Carlo sample x,...,x, ~ iid f(x) has been obtained, it can 
be used for much more than just estimating the mean of the distribution, 
u= Ex. For example, suppose we are interested in the (lower) p-quantile 
of the distribution, namely 

q,-F, (p) = {value of x such that F(x)- p }. 


The MC estimate of q, is simply q,, the empirical p-quantile of x,,..., X}. 
For instance, the median q,, can be estimated by the middle number 
amongst X,,...,X, after sorting in increasing order. This assumes that J is 
odd. If J is even, we estimate q,, by the average of the two middle 
numbers. Thus we may write the MC estimate of q, as 
X432)» J odd 
dj; = Xam + Xay 
2 

where X is the kth smallest value amongst X,,..,x, (k= 1,...,J). 


, J even, 


Also, we estimate the 1— « central density region (CDR) for x, namely 
(darz Qar) > by (055,0 25) - 


Further, suppose we are interested in the expected value of some function 
of x, say y = g(x). That is, we wish to estimate the quantity/integral 


y - Ey = | yf(y)dy = Eg (x) = | 409 fax. 


Then we simply calculate y; = g(x;) foreach j =1,..., J. The result will 


be a random sample y,,..., y, iid f (y) to which the method of Monte 
Carlo can then be applied in the usual way. Thus, an estimate of y is 


1 J 
y- pa y; (the sample mean of the y-values), 
j=l 


anda 1—a@ Cl for y is 
E S 
(Ptaa); 
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1 J 
where s; = 224 yj —y)° (the sample variance of the y-values). 
j=l 


This idea applies to even very complicated functions y = g(x) for which 
the exact or even approximate value of y = Ey would otherwise be very 


difficult to obtain, either analytically or numerically using a deterministic 
technique such as numerical integration (or quadrature). 


Also, the density f(x) can be estimated by smoothing a probability 
histogram of x,,..., X; . Likewise, the density f(y) can be estimated by 
smoothing a probability histogram of y,,..., y; . (This could be extremely 
useful if y is a very complicated function of x.) 


Note 1: As we will see later, it is often the case that we are able to sample 
from a distribution without knowing—or being able to derive—the 
exact form of its density function. 


Note 2: Smoothing a histogram requires some arbitrary decisions to be 
made about the degree of smoothing and other smoothing parameters. 
So the MC estimate of a density is not uniquely defined. 


Exercise 5.2 Monte Carlo estimation of complicated quantities 


Suppose that x ~ G(3,2). Use MC methods and a sample of size J = 1,000 


to estimate: 
u= Ex, the 80% CDR for x, and f(x) 


xe" 


y = Ey, the 80% CDR for y, and f (y), where y = —————. 
1+x+1/x 


Present your results graphically, and wherever possible show the true 
values of the quantities being estimated. Then repeat everything but using 
a Monte Carlo sample size of J = 10,000. 


Solution to Exercise 5.2 


The required graphs are shown in Figures 5.1 to 5.4. See the R code below 
for more details. 
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Figure 5.1 Histogram of x-value (J = 1,000) 
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Figure 5.2 Histogram of y-value (J = 1,000) 


e 
e 
u» 
N 
e 
N 
z 
E = 
v 
Q 
o 
u» 
e 
0.00 0.05 0.10 0.15 0.20 
y 


207 


Bayesian Methods for Statistical Analysis 


Figure 5.3 Histogram of x-value (J = 10,000) 
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Figure 5.4 Histogram of y-value (J = 10,000) 
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R Code for Exercise 5.2 


X11(wz8,h-4.5); par(mfrowzc(1,1)); options(digits=4); 

J = 1000; set.seed(221); xvzrgamma(J,3,2) 

xbar=mean(xv); xci=xbar + c(-1,1)*qnorm(0.975)*sd(xv)/sqrt(J) 
xcdr=quantile(xv,c(0.1,0.9)); xden=density(xv) 

yv=xv^2 * exp(-xv) / (1+ xv + 1/xv ) 

ybar=mean(yv); yci=ybar + c(-1,1)*qnorm(0.975)*sd(yv)/sqrt(J) 
ycdr=quantile(yv,c(0.1,0.9)); yden=density(yv) 
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hist(xv, prob=T,breaks=seq(0,7,0.25),xlim=c(0,7),ylim=c(0,0.6),xlab="x", 
main=""); lines(xden,|ty=2,lwd=2) 

xvec=seq(0,10,0.01); lines(xvec,dgamma(xvec,3,2),lty=1,lwd=2) 

abline(v= c(xbar, xci, xcdr), Ityz2, lwd=2) 

abline(v=c(3/2,qgamma(c(0.1,0.9),3,2)), Ity=1,lwd=2) 

legend(4,0.6,c("MC estimates","True values"), Ity=c(2,1),lwd=c(2,2)) 


hist(yv,prob=T, breaks=seq(0,0.2,0.005),xlim=c(0,0.2),ylim=c(0,30),xlab="y", 
main=""); lines(yden,Ity=2,lwd=2) 

abline(v= c(ybar, yci, ycdr), Ityz2, lwd=2) 

legend(4,0.6,c("MC estimates","True values"), Ity=c(2,1),lwd=c(2,2)) 


# Repeat with J = 10000 ------------------------------ 


J = 10000; set.seed(221); xv=rgamma(J,3,2) 

xbar=mean(xv); xci=xbar + c(-1,1)*qnorm(0.975)*sd(xv)/sqrt(J) 
xcdr=quantile(xv,c(0.1,0.9)); xden=density(xv) 

yv=xv^2 * exp(-xv) / (1+ xv + 1/xv ) 

ybar=mean(yv); yci=ybar + c(-1,1)*qnorm(0.975)*sd(yv)/sqrt(J) 
ycdr=quantile(yv,c(0.1,0.9)); yden=density(yv) 


hist(xv, prob=T, breaks=seq(0,9,0.25),xlim=c(0,7),ylim=c(0,0.6),xlab="x", 
main=""); lines(xden,|ty=2,lwd=2) 

xvec=seq(0,10,0.01); lines(xvec,dgamma(xvec,3,2),Ilty=1,lwd=2) 

abline(v= c(xbar, xci, xcdr), Ityz2, lwd=2) 

abline(v=c(3/2,qgamma(c(0.1,0.9),3,2)), Ity=1,lwd=2) 

legend(4,0.6,c("MC estimates","True values"), Ity=c(2,1),lwd=c(2,2)) 


hist(yv,prob=T, breaks=seq(0,0.2,0.005),xlim=c(0,0.2),ylim=c(0,30),xlab="y", 
main="") 


lines(yden, Ity=2,lwd=2); abline(v= c(ybar, yci, ycdr), Ityz2, lwd=2) 
legend(4,0.6,c("MC estimates","True values"), Ity=c(2,1),lwd=c(2,2)) 


5.4 Importance sampling 


When applying the method of MC to estimate an integral of the form 
y =Eg(x)=[9( f dx, 


suppose it is impossible (or difficult) to sample from f (x), but it is easy 
to sample from a distribution/density h(x) which is ‘similar’ to f(x). 
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Then we may write 


v=| [so ote 2 oan = J whoa, 


where 


(x) 
h(x) ` 


w(x) = g(x) 


This suggests that we sample x,,..., x, ~ iid h(x) and use MC to estimate 
y by 


where 


o 


This techniques is called importance sampling, and there are several 
issues to consider. As already indicated, the method works best if h(x) is 


chosen to be very similar to f(x). 


Another issue is that f(x) may be known only up to a multiplicative 
constant, i.e. where f(x)=k(x)/c , where the kernel k(x) is known 
exactly but it is too difficult or impossible to evaluate the normalising 
constant c =Í k(x)dx . In that case, we may write 


KOP J Ikodx 


w=fg%) [rok 
k(x) 
7 IGG h(x r hoa | Fwd 
E JA nc. " E fucohdx ' 
h(x) 
where: 
7 k(x) 
w(x) = g(x) h(x) 
_ k(x) 
u(x) = h(x) 
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This suggests that we sample x,..., x, ~ lid h(x) (as before) and apply 
MC estimation to the means of w(x) and u(x), respectively (each with 
respect to the distribution defined by density h(x)) so as to obtain the 
estimate 


LM. J? a _ Wt. t Wy 
u 1 u, ttu, 
J4 J 
where w, = w(x,) a må u, =u(x;). 


Exercise 5.3 Example of Monte Carlo with importance sampling 


We wish to find 4 = Ex where x has density 
1 x 
f (x) x ——e™,x>0. 
x+1 


Use Monte Carlo methods and importance sampling to estimate w. 


Solution to Exercise 5.3 


Here, k(x) = : l e *, and it is convenient to use h(x)=e*,x>0 
x+ 
(the standard exponential density, or Gamma(1,1) density). Then, 
2 f xk(x)dx 
u= Ex = | xf (x)dx = 
J f k(x)dx 


So a MC estimate of w is ĝ = 


where x,,..., x, ~ iid G(1,1). 
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| 0.40345 


= ——— - 0.67631. 
0.59655 


Implementing this with J = 100,000, we get £ 


Note 1: For interest we use numerical techniques to get the exact answer, 
u = 0.67687. 
Thus the relative error is 0.08496. Figure 5.5 illustrates. 


Note 2: The exact value of the normalising constant is 
C= [ k(x)dx is 0.596347. 


From the above we see that our MC estimate of c is 0.59655 (similar). 


Figure 5.5 Illustration of importance sampling 


o | 
N 
X E(x) = area under x*f(x) — - f(x) = (1/c)*exp(-x)/(x+1) 
* E(x) = area under x*h(x) —— h(x) = exp(-x) 
e. ] x © MC estimate of E(x) 
z 
[7] e 
E B 
[7] 
E] 
o 
e 
> | 
eo 


R Code for Exercise 5.3 


options(digits=10); 

kfun=function(x){ exp(-x)/(x+1) } 
c=integrate(f=kfun,lower=0,upper=Inf)Svalue; c tt 0.5963473624 
ffun=function(x){ (1/ 0.5963473624)*exp(-x)/(x+1) } 
integrate(f=ffun,lower=0,upper=Inf)Svalue; #0.9999999999 

xffun= function(x){ x*(1/0.5963474)*exp(-x)/(x*1) } 

muz= integrate(f=xffun,lower=0,upper=Inf)Svalue; mu # 0.6768749849 
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J=100000; set.seed(413); xvzrgamma(J,1,1) 

num=mean(xv/(xv+1)); den=mean(1/(xv+1)) 

est=num/den; c(num, den, est) # 0.4034510685 0.5965489315 0.6763084254 
err=100* (est-mu)/mu; err # -0.08370222467 


plot(c(0,3),c(0,2),typez"n",xlabz"x",ylabz"density"); xvec=seq(0,5,0.01); 
lines(xvec,dgamma(xvec,1,1),lty=1,lwd=3) 
lines(xvec,xvec*dgamma(xvec,1,1),lty=1,lwd=1) 
lines(xvec,ffun(xvec),Ityz2,Iwdz3); lines(xvec,xvec*ffun(xvec),Ityz2,Iwdz1) 
points(c(1,mu,est),c(0,0,0),pch=c(16,4,1),lwd=c(2,2,2),cex=c(1.2,1.2,1.2)) 
legend(1.7,2,c( "f(x) = (1/c)*exp(-x)/(x+1)", "h(x) = exp(-x)" ), 

Ityzc(2,1), Iwdzc(3,3)) 
legend(1.7,1.3,c( "x*f(x)", "x*h(x)" ), Ityzc(2,1), Iwdzc(1,1)) 
legend(0.5,2,c("E(x) = area under x*f(x)", "E(x) = area under x*h(x)", 

"MC estimate of E(x)"), pch=c(4,16,1),pt.lwd=c(2,2,2), pt.cex=c(1.2,1.2,1.2)) 


5.5 MC estimation involving two or more 
random variables 


All the examples so far have involved only a single random variable x. 
However, the method of Monte Carlo generalises easily to two or more 
random variables. In fact, the procedure for MC estimation of the mean of 
a function, as described above, is already valid in the case where x is a 
vector. We will now focus on the bivariable case, but the same principles 
apply when three or more random variables are being considered 
simultaneously. 


Suppose that we have a random sample from the bivariate distribution of 
two random variables x and y, denoted (x, y,),...,(x,, y;) » iid f(x, y), 

and we are interested in some function of x and y, say r = g(x, y). Then 
we simply calculate r, 2 g(x;, y;) and perform MC inference on the 


resulting sample r,...,r, iid f(r). 


Note 1: This procedure applies whether or not the random variables x 
and y are independent. If they are independent then we simply sample 


x," f(x) and y, ~ f(y). 


Note 2: If x and y are dependent, it may not be obvious how to generate 
(xo efie ye 
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Then, one approach is to apply the method of composition, as detailed 
below. If that fails, other methods are available, in particular ones which 
involve Markov chain theory. Much more will be said on these methods 
later in the course. 


5.6 The method of composition 


Suppose we wish to sample a vector (x;, y;) ~ f(x, y). Often this can be 
done in two different ways via the method of composition, as follows. 


One way is to first sample x; ~ f(x) and then sample y; ^ f (y |x;). The 
result will be the desired (x;, y;) ^ f (x, y). This follows by the identity 
(or ‘composition’) 


f Gy) - feo fG Ix). 


Note: Having obtained (x;, y;) * f (x, y) in this manner, suppose we 
‘discard’ x,. Then this will leave behind a single number, y, ~ f(y). 


This could be useful if all we really want is a sample from f (y) but 
sampling from this distribution/density directly is difficult. 


Alternatively, first sample y, ~ f(y) and then sample x, ~ f(x|y;). 
The result will again be (x;, y;) ^ f (x, y). This follows by the identity 
f(x y= fO) f(xly). 


Note: Having obtained (x,,y,)~ f(x, y) in this second manner, 
suppose that we ‘discard’ y;. This will leave behind a single number, 
x; ~ f(x). This could be useful if all we really desire is a sample from 
f (x) but sampling from this distribution/density directly is difficult. 


This idea of composition generalises easily to higher dimensions. For 
example, one of several different ways to sample a triplet 


(X; y, Z) ~ f(x,y,z) 
is first sample y; ~ f(y), then sample x; ^ f(x|y,) and finally sample 
z; ~ f(z|x;,y;). This works because of the identity 


f(x yZ2= FW) F(x y) f(z|x,y). 
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Exercise 5.4 


Suppose that we are interested in the distribution of a random variable 
defined by r= y/ (x+ o] ), where x and y have a joint distribution 


defined by the pdf f(x,y) = f(x)f(y|x), and where x ~ G(3,2) and 
(y |x) ~ N(x,x). 


Use the R functions rgamma() and morm() to generate a sample of size 
J = 1,000 from the joint distribution of x and y. Then use the method of 
MC to estimate y = Er, and report a 95% CI fory . Also estimate the 


80% CDR for r and f(r). Present your results both graphically and 
numerically. 


Solution to Exercise 5.4 
Numerically, we estimate y by 0.4256, and our 9596 CI for w is 


(0.4026, 0.4486). We also estimate the 80% CDR for r by (—0.1025, 
0.8339). The required graph is shown in Figure 5.6. 


Figure 5.6 Histogram of r-values (J = 1,000) 
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R Code for Exercise 5.4 
X11(wz8,h-4.5); par(mfrowzc(1,1)); options(digits=4); 


J = 1000; set.seed(221); xvzrgamma(J,3,2); yv = rnorm(xv,sqrt(xv)) 
rv = yv/(xv*sqrt(abs(yv))) 

rbar=mean(rv); rci=rbar + c(-1,1)*qnorm(0.975)*sd(rv)/sqrt(J) 
rcdr=quantile(rv,c(0.1,0.9)); rden=density(rv) 


c(rbar,rci,rcdr) #0.4256 0.4026 0.4486 -0.1025 0.8339 


hist(rv,prob=T, breaks=seq(-1,1.8,0.1),xlim=c(-1,1.6), ylim=c(0,1.3),xlab="r", 
main=""); lines(rden,Ityz1,Iwdz2); abline(v= c(rbar, rci, rcdr), Ityz2, lwd=2) 


5.7 Monte Carlo estimation of a binomial 
parameter 


Suppose we are interested in a binomial proportion (i.e. probability) p but 
have difficulty calculating this quantity exactly. Then we may interpret p 
as the mean x of a Bernoulli distribution and directly apply the method 


of Monte Carlo in the usual way. In this special case, there are certain 
simplifications which result in slightly different-looking final formulae. 


Explicitly, suppose we are able to generate 
X-X; ~ lid Bernoulli(p). 
Then the MC estimate of p is 
1 J 
X= 5 ux ; (the sample proportion of 1s in the sample), 
ja 
and the MC sample variance is 
2 1 : 2 2 
S = — x; dk 
bx 


1 
= —( Ix — Ix’) since x? =x, (because each x, is O or 1) 
Jel J J J 


0-3. 


Botie MO Shag ae E ES. 
JJ Ji f=1 
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It follows that a MC 1—« CI for p is 
= S = Ix(1- x) 
XtZ,——|-|xtz : 
| a2 =) | a2 JT | 


The MC estimate x is often written as p , and J —1 is often replaced by 


J (for simplicity). These changes lead to the standard form of the MC 
1—a confidence interval for p, 


: B(L— p 
[bens PA Bl 


J 


Note 1: The above theory is really nothing other than the usual classical 
theory for estimating a binomial proportion. Thus, there are many other 
CIs that could be substituted, (e.g. the Wilson CI whose coverage is 
closer to 1—a@, and the Clopper-Pearson CI whose coverage is always 
guaranteed to be at least 1— æ but which is typically wider). 


Note 2: The above MC inference depends on the x; values only by way 
of the sample mean x or, equivalently, by way of the sample total 
Xr =X +...+X, =Jx. A consequence of this is that exactly the same 
Monte Carlo inference can be performed if we observe only a single 
value of the total x. , whose distribution is given by x, ^ Bin(J, p). 


Note 3: A common application of the theory here is where the binomial 
parameter is the probability of some event involving random variables, 
for example p= P(x>1) and p - P(x« y). 


For the first example here, we generate x, ~ f(x), let r, = I(x, » 1), and 
then repeat independently many times so as to generate a random sample 
5-1, ~ iid Bern(p). That sample can then be used for MC inference 
onip Pie le 


The procedure for the second example is similar, except that it involves 
sampling (x,, y,) ^ f (x, y) and determining r, = I(x, < y,), etc. 
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Note 4: One use of MC CIs for a binomial proportion is to assess the 
coverage of MC CIs. 


Often, the true coverage probability of a MC CI is not exactly the 
nominal level, say 9596. This may be due to the MC sample size J being 
insufficiently large or for some other reason. 

If we are concerned about this, we may wish to estimate the true 
coverage of the MC CI by repeating the entire MC inference procedure 
itself a large number of times, say M. Each time we record an indicator 
r for the MC CI containing the quantity of interest. 


The result will be a sample r,...,r,, ~ iid Bern( p), where p is the true 


coverage probability, which can then be estimated via MC methods in 
the usual way. 


Exercise 5.5 Estimating a probability via Monte Carlo 


Use MC to estimate p = Pf | > oae ,where x^ Gamma(3,2). 
Vx+ 


Solution to Exercise 5.5 


With J = 20,000, we sample x,,..., x, ^ iid G(3,2) and let 


X. nt 
r-I 1— »0.3e" |. 
x,*1 


n 1X 
Thereby we obtain an estimate of p equalto p= TÈ = 0.2117 
j=1 


. P(1— P 
and a 9596 CI for p equal to | $1.96] 2 0 — P) | = (0.2960, 0.2173). 
200000 


Note 1: We may also view p as p= P(y > 0.3), where y=e ~ a 
x+ 


(for example). In that case, we sample x,,..., x, ~ iid G(3, 2), calculate 
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SONNEN. 
y; =e - o and then let r, = I( y; > 0.3). This leads to exactly 


X, 


the same results regarding p. As a by-product of this second approach, 
we obtain an estimate of the density function of the random variable 


Jem ir ,namely f(y), which would be very difficult to obtain 
xX 


analytically. Figure 5.7 illustrates. 


Note 2: The density() function in R used to smooth the histogram does 
not adequately capture the upper region of the density f(y), nor the 


fact that f (y) 0 when y « 0. 


Figure 5.7 Histogram of 20,000 values of y 
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R Code for Exercise 5.5 
X11(w=8,h=4.5); par(mfrowzc(1,1)); options(digits=4) 


J=20000; set.seed(162); xv=rgamma(J,3,2); ct=0 

yv= sqrt(xv)*exp(-xv) / sqrt(xv+1) 

for(jin 1:J) if(yv[j] > 0.3) ct=ct+1 

phat=ct/J; ci1phat+c(-1,1)*qnorm(0.975)*sqrt(phat*(1-phat)/J) 
c(phat,ci) # 0.2117 0.2060 0.2173 
hist(yv,prob=T,breaks=seq(0,0.5,0.005),xlim=c(0,0.4),xlab="y",main=" ") 
abline(v=0.3,lwd=3); lines(density(yv),lwd=3) 


219 


Bayesian Methods for Statistical Analysis 
Exercise 5.6 Buffon's needle problem 


A needle of length 10 cm is dropped randomly onto a floor with lines on 
it that are parallel and 10 cm apart. 


(a) Analytically derive p, the probability that the needle crosses a line. 


(b) Now forget that you know p. Estimate p using Monte Carlo methods 
on a computer and a sample size of 1,000. Also provide a 9596 confidence 
interval for p. Then repeat with a sample size of 10,000 and discuss. 


Solution to Exercise 5.6 


(a) Let: X = perpendicular distance from centre of needle to nearest line 
in units of 5 cm 
Y - acute angle between lines and needle in radians 
C = ‘The needle crosses a line’. 


Then: X ~U(0,1) with density f(x)210«x«1 
ya u[0.5) with density ja 0«y gm 
2 T 2 
X LY (i.e. X and Y are independent, so that 
2 mT 
f(x,y) = PUO ale S. Deb ee) 


C={X <sinY}={(x,y):x<sin y}. 
Figure 5.8 illustrates this setup. 


It follows that 
p= P(C)= P(X <sinY) 


1/2 


2 
dy =— | sinyd 
y - y dy 


71/2 


2 
= ff fee == f 


sin y 


fax 


x<sin y y=0\ x—0 y=0 
2 A 2 
= =|- cos yl. | = 2|- cos|=) — cos o) 
T T 2 
2 2 


==(-0-(-1)) == = 0.63662. 
T T 


Figure 5.9 illustrates the integration here. 
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Figure 5.8 Illustration of Buffon's needle problem 


Pd 


Figure 5.9 Illustration of the solution to Buffon's needle 
problem 


Complement of C 


x - sin(y) 
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Note 1: Another way to express the above working is to first note that 
P(C|y)=P(C|Y =y)=P(X <siny|y)=P(X <siny)=siny, 
since (X | y)~ X ~U(0,1) with cdf F(x|y)=F(x)=x,0<x<1. 


It follows that 


mi2 


p= P(C) = EP(C |Y) = EsinY = f (sin y) dy e 
0 T T 
as before. 
Note 2: It can be shown that if the length of the needle is r times the 


distance between lines, then the probability that the needle will cross a 
line is given by the formula 


DITS r<1 
p= iS [einem (2) rS 
1 r 


(b) For this part, we will make use of the analysis in (a) whereby 
C={4(x,y):x< sin y}, 
and where: 


x ~U(0,1), y-u[04), X Ly. 


Note: We suppose that these facts are understood but that the integration 
required to then proceed on from these facts to the final answer (as in 
(a)) is too difficult. 


We now sample x,..., x, ~ iid U(0,1) and y,,...,y, ~ iid U(0, 2/2) (all 
independently of one another). Next, we obtain the indicators defined by 


K ly 1 ifx,«siny, 
r =1(x.<siny,)= 
i 2 ? 0 otherwise. 


The result is the MC sample r,...,r, ~ iid Bern(p) (i.e. a sample of 
size J to be used for inference on p). (Equivalently, we may obtain 


r, =f +...+r, ~ Bin(J, p), which will lead to the same final results.) 
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Carrying out this experiment in R with J = 1,000 we get 
P = 0.618 and CI = (0.588, 0.648). 


Then repeating, but with J = 10,000 instead, we obtain 
p = 0.633 and CI = (0.624, 0.643). 


We see that increasing the MC sample size (from 1,000 to 10,000) has 
reduced the width of the MC CI from 0.060 to 0.019. Both intervals 
contain the true value, namely 2/z = 0.6366. 


R Code for Exercise 5.6 


# (a) 

X11(w=8,h=4.5); par(mfrow=c(1,1)) 
plot(seq(0,pi/2,0.01),sin(seq(0,pi/2,0.01)), type="I",lwd=3,xlab="y", ylab="x") 
abline(v2c(0,pi/2),Ity23); abline(h-c(0,1),Ity23) 

text(0.2,0.4,"x = sin(y)"); text(1,0.4,"C"); text(0.35,0.8," Complement of C") 
text(1.52,0.06,"pi/2") 


# (b) 
J=1000; set.seed(213); xv=runif(J,0,1); yv=runif(J,0,pi/2); rv=rep(0,J) 
options(digits=4); for(j in 1:J) if(xv[j]«sin(yv[j])) rv[jJ=1 


phat=mean(rv); zzqnorm(0.975); pci=phat+c(-1,1)*z*sqrt(phat*(1-phat)/J) 
c(phat,pci,pci[2]-pci[1]) # 0.61800 0.58789 0.64811 0.06023 


J=10000; set.seed(215); xv=runif(J,0,1); yv=runif(J,0,pi/2); rv=rep(0,J) 
for(j in 1:J) if(xv[j]«sin(yv[j])) rv[j]21 


phat=mean(rv); z=qnorm(0.975); pci=phat+c(-1,1)*z*sqrt(phat*(1-phat)/J) 
c(phat,pci,pci[2]-pci[1]) # 0.63320 0.62375 0.64265 0.01889 
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Exercise 5.7 MC Cls for the coverage probabilities of MC Cls 
for a gamma mean 


(a) Using the R function rgamma(), generate a random sample of size 
J = 100 from the gamma distribution with parameters 3 and 2 and mean 
4 = 3/2. Then use the method of Monte Carlo to estimate 4. In your 
estimation, include a 9596 CI for w and the width of this CI. Also report 
whether the CI contains the true value of u. 


(b) Repeat (a) but with J = 200, 500, 1,000, 10,000 and 100,000, 
respectively. Report the widths of the resulting CIs and, for each CI, state 
whether it contains 4 . Discuss any patterns that you see. 


(c) Repeat (a) M = 100 times and report the proportion of the resulting M 
9596 MC CIs which contain the true value of the mean. (In each case use 
J = 100.) Hence calculate a 95% CI for p, the true coverage probability of 
the 95% MC CI for 4 based on a MC sample of size J = 100 from the 


Gamma(3,2) distribution. 


(d) Repeat (c), but with M = 200, 500, 1,000 and 10,000, respectively. 
Discuss any patterns that you see. 


Solution to Exercise 5.7 


(a) Applying the procedure (see the R code below) we estimate u by 
X = 1.517. The Monte Carlo 95% confidence interval for 4 is 
CI = (X £z, ,,s/ AJ) = (1.354, 1.680). 


We observe that this interval has width 0.326 and contains 4. 


(b) Repeating (a) as required, we obtain: 

= 1.471 and CI = (1.348, 1.593) with width 0.245 for J= 200 

= 1.430 and CI = (1.358, 1.502) with width 0.144 for J= 500 

= 1.475 and CI = (1.419, 1.530) with width 0.111 for J= 1,000 
= 1.490 and CI = (1.473, 1.508) with width 0.0344 for J= 10,000 
= 1.502 and CT = (1.497, 1.507) with width 0.0107 for J= 100,000. 


x| ^| x| x] x] 


We see that X appears to be converging towards 4 = 1.5. The width of 
the CI appears to be decreasing as J increases. Each of these five CIs 
contains 4 , just like the CI in (a). 
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(c) Repeating (a) M = 100 times leads to M = 100 MC CIs of which 93 
contain 4 = 1.5. Thus p - 9396, which as expected is ‘close’ to the 9596 
nominal coverage of the CT. 


.93(1— 0.93 
A 9596 CI for p is [0985196 E) - (0.880 0.980). 


This is consistent with the MC 9596 CI for u having coverage 95%. 


(d) Repeating (a) M = 200 times leads to p = 94.596 of the 200 CIs 
containing 1.5, with a 9596 CI for p, 
0.945(1 — 0.945) 


0.945 € 1.96 
200 


| = (0.913, 0.977). 


Repeating (a) M = 500 times leads to p = 94.2% of the 500 CIs 
containing 1.5 with a 9596 CI for p, 


[nsi 11.96) Ex) - (0.922, 0.962). 


Repeating (a) M = 1,000 times leads to p = 93.5% of the 1,000 CIs 
containing 1.5, with a 9596 CI for p, 


(0985+196 E = (0.935, 0.963). 


1,000 


Repeating (a) M = 10,000 times leads to p = 94.4% of the 10,000 CIs 
containing 1.5, with a 95% CI for p, 


0.944::1.96, [0:9 40.—0-94) = (0.940, 0.949). 
10,000 


The widths of all five CIs for p are: 0.100, 0.063, 0.041, 0.027 and 0.009. 
We see that the CI for p becomes narrower as M increases. Also, the 
proportion of CIs containing 1.5 converges towards 9596 as M increases. 
The convergence does not seem to be uniform. This is because of Monte 
Carlo error. If we repeated the experiment again, we might find a slightly 
different pattern. 


Each of the CIs for p is consistent with p — 0.95, except the one with 
M = 10,000, which is the most reliable. In that case the CI for p is 
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(0.940, 0.949), which is entirely below 0.95. This suggests that the true 
coverage probability of the 95% MC CI for u is slightly less than 9596. 


The observed proportions appear to be converging to this limit rather than 
to 9596 exactly. This is explainable by the fact that the MC sample size 
J = 100 is far from infinity. If we repeated (d) with a larger value of J in 
each case, say J = 1,000, we would see the proportion of the M CIs 
converge towards a limiting value which is even closer to 9596. But then 
an even larger value of M would be necessary to establish that there is in 
fact any difference between the limiting value and 9596. 


R Code for Exercise 5.7 


it (a) 

options(digits=5); J = 100; set.seed(221); xv=rgammai(J,3,2) 
xbar=mean(xv); ci=xbar + c(-1,1)*qnorm(0.975)*sd(xv)/sqrt(J) 
c(xbar,ci) # 1.5170 1.3539 1.6800 


# (b) 

Jvec=c(100,200,500,1000,10000,100000); K = length(Jvec) 

xbarvec=rep(NA,K); LBvec= rep(NA,K); UBvec= rep(NA,K); 

set.seed(221); 

for(kin 1:K){ — J=Jvec[k]; xv=rgamma(J,3,2); xbarzmean(xv) 
ci=xbar + c(-1,1)*qnorm(0.975)*sd(xv)/sqrt(J) 
xbarvec[k]=xbar; LBvec[k]=ci[1]; UBvec[k]=ci[2] 
} 

Wvec=UBvec-LBvec 

print(rbind(Jvec, xbarvec, LBvec,UBvec, Wvec),digits=4) 


# Jvec 100.0000 200.0000 500.0000 1000.000 1.000e+04 1.000e+05 
#xbarvec 1.5170 1.4705 1.4299 1.475 1.490e+00 1.502e+00 
#LBvec 1.3539 1.3480 1.3577 1.419 1.473e+00 1.497e+00 

4 UBvec 1.6800 1.5930 1.5020 1.530 1.508e+00 1.507e+00 
#Wvec 0.3261 0.2451 0.1443 0.111 3.441e-02 1.073e-02 


# (c) 

J=100; M=100; ct=0; set.seed(442); for(m in 1:M){ 
xv=rgamma(J,3,2) 
xbar=mean(xv); ci=xbar + c(-1,1)*qnorm(0.975)*sd(xv)/sqrt(J) 
if((ci[1]<=1.5)&&(1.5<=ci[2])) ct = ct + 1 } 

p=ct/M; ci=ptc(-1,1)*qnorm(0.975)*sqrt(p*(1-p)/J) 

c(ct,p,ci) # 93.00000 0.93000 0.87999 0.98001 
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# (d) 
J=100; Mvec=c(200,500,1000,10000); set.seed(651) 
for(M in Mvec){ ct=0 
for(m in 1:M){ 
xv-rgamma(J,3,2); xbarzmean(xv) 


ci=xbar + c(-1,1)*qnorm(0.975)*sd(xv)/sqrt(J) 
if((ci[1]<=1.5)&&(1.5<=ci[2])) ct = ct +1 
} 
p=ct/M; ci=p+c(-1,1)*qnorm(0.975)*sqrt(p*(1-p)/M) 
print(c(M,p,ci,ci[2]-ci[1]),digitsz3) ) 


# [1] 200.0000 0.9450 0.9134 0.9766 0.0632 

# [1] 500.000 0.942 0.922 0.962 0.041 

# [1] 1.00e+03 9.49e-01 9.35e-01 9.63e-01 2.73e-02 
# [1] 1.00e+04 9.44e-01 9.40e-01 9.49e-01 9.00e-03 


5.8 Random number generation 


So far we have assumed the availability of the sample required for Monte 
Carlo estimation, such as x,,..., x, iid f (x). The issue was skipped over 


by making use of ready made functions in R such as runif(), rbeta() and 
rgamma(). However, many applications involve dealing with complicated 
distributions from which sampling is not straightforward. 


So we will next discuss some basic techniques that can be used to generate 
the required Monte Carlo sample from a given distribution. More 
advanced techniques will be treated later. We will first treat the discrete 
case, which is the simplest, and then the continuous case. It will be 
assumed throughout that we can at least sample easily from the standard 
uniform distribution, i.e. that we can readily generate u ^ U(0,1). 


Note: This sampling is easily achieved using the runif() function in R. 
Alternatively, it can be done physically by using a hat with 10 cards in 
it, where these have the numbers 0,1,2,....,9 written on them. Three cards 
(say) are drawn out of the hat, randomly and with replacement. The three 
numbers thereby selected are written down in a row, and a decimal point 
is placed in front of them. The resulting number (e.g. 0.472, 0.000 or 
0.970) is an approximate draw from the standard uniform distribution. 
Repeating the entire procedure several times results in a random sample 
from that distribution. Increasing ‘three’ above (to ‘five’, say) improves 
the approximation (e.g. yielding 0.47207, 0.00029 or 0.97010). 
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5.9 Sampling from an arbitrary discrete 
distribution 


Suppose we wish to sample a value x ~ f(x) where f(x) is a discrete 
pdf defined over the possible values x = x,,..., x, . First define 
f, = fO) 
and 
F, = fi+..+ fą (k= 1,...,K), 
noting that F, —1. 


Then sample u ~ U (0,1), and finally return: 


X=X, if 0<u<F 
X =X, HE cus 
X= Xy if F,.,<u<F, (21). 


One way to implement the above is to set k = 1, to repeatedly increment k 
by 1 until F, , «u € F, , and then, using the final value of k thereby 


obtained, to return x = x,. 


Note 1: We see that this procedure will work also in the case where K is 
infinite. In that case a practical alternative is to redefine K as a value k 


for which F, is very close to 1 (e.g. 0.9999) and then approximate f(x) 


by zero for all x > x,. 


Note 2: In R, an alternative to using u ~ U(0,1) is to apply the function 
sample() with appropriate specifications of x,,.., x, and f,,..., fg (as 
illustrated in an exercise below). 


Exercise 5.8 Example of sampling from a simple discrete 
distribution 


Show that the above method works when applied to generating a value x 


from the Bin(2,1/2) distribution, i.e. that it returns x = 0, 1 and 2 with 
probabilities 1/4, 1/2 and 1/4, respectively. 
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Solution to Exercise 5.8 


In this case, K 2 3and: x, 7-0, F(x,)=P(x<0) =0.25 
1, F(x,)=P(x<1) =0.75 
,72, F(x,)=P(x<2) = 1.00. 


Let u ~ U(0,1). Then the method returns: 
x= x, =0 if 0<u< F(x) ie.if 0.00 <u < 0.25 


x=x=1 if F(x)<u < F(x) ie.if 0.25<u< 0.75 
x=x =2 if F(x) <u < F(x,) ie.if 0.75 < u< 1.00. 


Thus, x has: 0.25 — 0.00 = 0.25 probability of being set to 0 


0.75 — 0.25 = 0.50 probability of being set to 1 
1.00 — 0.75 = 0.25 probability of being set to 2 (all correct). 


Exercise 5.9 Sampling from a complicated discrete distribution 


Consider the discrete distribution defined by the pdf 


f(x) e x= 35. 
+ 


Find the mean of the distribution by performing appropriate summations. 
Then generate a random sample from this distribution and use it to 
confirm the mean. 


Solution to Exercise 5.9 


Using R we calculate k(x) = x =1,3,5,...,41 (here k stands for 


1+ e 
kernel), noting that the last two values of k(x) are tiny (9.455201e-14 and 
1.454999e-14). 


We then calculate the sum of the kernel values, 
c=k(1)+k(3)+...+k(41) = 1.051009, 
and thereby normalise the kernel to obtain 


f(x) = mS xcd. a. 
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The pdf may also be written as f(x) - k(x)/c, x —x,..,X, , where: 
x, = 2k—1; k =1,...,K ; K - 21. The exact mean of the distribution is then 
evaluated numerically as 


um f(x) = 3.6527. 


k=1 


Note: Changing 41 to 101 here changes the approximation to 3.6527, 
i.e. makes no difference to 4 decimals. This suggests that taking the 
upper bound as 41 is good enough. 


To sample J = 100,000 values from the distribution we may write 
sample(x=xvec,size=J,replace=TRUE,prob=fvec) 

where xvec is a vector with values 1,3,...,41 and fvec is a vector with the 

values f (1), f (3),..., f (41) (see the R Code below). 


Note: We could also change fvec to kvec here, where kvec is a vector 
with the values k(1), k(3),..., k(41) ; both possibilities will work since 
sample() will automatically normalise the values in its parameter *prob'. 


The Monte Carlo estimate of 4 works out as 3.6494 with 9596 CI 
(3.6374, 3.6615). We note that this CI contains the true value, 3.6527. 


R Code for Exercise 5.9 


kfun = function(x){ x^3*exp(-x)/(1 + sqrt(x)) }; options(digits=5) 
xvec=seq(1,41,2); kvec=kfun(xvec); c =sum(kvec); c 4 1.051 
fvec-kvec/c; sum(fvec) #1 
print(rbind(xvec,fvec)[,1:9],digits=3) 
# xvec 1.000 3.000 5.000 7.0000 9.0000 11.0000 13.00000 1.50e+01 1.70e+01 
# fvec 0.175 0.468 0.248 0.0816 0.0214 0.0049 0.00103 2.02e-04 3.78e-05 
sum(xvec*kvec)/sum(kvec) # 3.6527 
# Check that 41 is large enough: 
xvec=seq(1,101,2); kvec=kfun(xvec); sum(xvec*kvec)/sum(kvec) 
# 3.6527 (same) 
# Sample from the distribution 
xvec=seq(1,41,2); kvec=kfun(xvec); J=100000; set.seed(332); 
samp = sample(x=xvec,size=J,replace=TRUE, prob=fvec) 
est =mean(samp); std=sd(samp); cizest4c(-1,1)*qnorm(0.975)*std/sqrt(J) 
c(est,ci) # 3.6494 3.6374 3.6615 
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5.10 The inversion technique 


Suppose we wish to sample x, a value of a continuous random variable X 
with cdf F,(x). One way to do this is using the inversion technique, 
defined as follows, with the underlying theorem and proof shown below. 


First derive the quantile function of X, denoted F,'(p) (0 « p < 1). 
(This can be done by setting F, (x) to p and solving for x.) 


Next, generate a random number u from the standard uniform distribution. 
(It will be assumed that this can be done easily, e.g. using runif() in R.) 


Then return x = F, '(u) as a value sampled from the distribution of X. 


Theorem 5.1: Suppose that X is a continuous random variable with cdf 
F,(x) and quantile function F, (p). Let U ~ U(0,1), independently of 
X, and define R = F,'(U). Then R has the same distribution as X. 


Proof of Theorem 5.1: Observe that U has cdf F;(u) u, 0«u «1. 
This implies that R has cdf 

EI) -PORSErPDeBPUU (QD) < F (r)) = P(U < F; (r)) = F; (r) . 
Thus, R has the same cdf as X and therefore the same distribution. 


Note: A complication with the inversion technique may arise if there is 
difficulty deriving the quantile function Fy '( p) . In that case, since the 
task is fundamentally to solve F,(x)=u for x, it may be useful to 
employ the Newton-Raphson algorithm to the problem of solving the 
equation g(x) 20, where g(x) = F,(x)—u. 


Exercise 5.10 Practice at the inversion technique 


(a) Using u = 0.371 as a value from the standard uniform distribution, 
obtain a value from the standard exponential distribution. Then generate 
a large random sample U,,...,U, ~ iid U(0,1) (of size J = 1,000 say) and 
use this to create a random sample of the same size from the standard 
exponential distribution. Check your results by calculating an estimate of 
the mean of that distribution and also a 95% CI for that mean. Compare 
your results with the true value of that mean, namely 1. 
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(b) Using u = 0.371 as a value from the standard uniform distribution, 
obtain a value from the gamma distribution with mean and variance both 


equal to 2. Then generate a large random sample U,,...,U, ~ iid U(0,1) (of 


size J — 1,000, say) and use this to create a random sample of the same 
size from the said gamma distribution. Check your results by calculating 
an estimate of the mean of that distribution and also a 9596 CI for that 
mean. Compare your results with the true value, namely 2. 


Solution to Problem 5.10 


(a) Let X ~ G(1,1) with density function f(x)-e", x > 0, and cdf 


F(x) =| e ‘dt =1—e™“, x > 0. The quantile function here is the solution 


o t— x 


x 


of 1- e" = p, namely F '(p) - -log(1- p). 


So a value from the standard exponential distribution is easily computed 
as x= F !(u) - -log(1-0.371) = 0.463624. 


Taking J = 1,000, we now generate u,,...,u, ~ iid U(0,1) in R using the 
runif() function, and then calculate x, = —log(1— u;) for each j = 1,...,J. 


This results in the required sample x,..,x, ~ iid G(1,1). Using this 
sample, the MC estimate of 4 — EX is 0.9967, and a 9596 CI for 4 is 


(0.9322, 1.0613). We see that the CI contains the true value being 
estimated (i.e. 1). 


(b) Here, X ^ G(2,1) with mean 2/1, variance 2/1° =2, pdf f(x) 2 xe“ 
and cdf 


P(x)e jid Es «ce 


| ~[u-eyat 


t x 


=—xe™ +0+|-e | =—xe“-e*+1=1-(x+l)e™. 


We see that the quantile function of X, F x p), does not have a closed 
form expression, since it is the root of the function 


g(x) = F(x)- p-1-(x*De "=p 
(i.e. the solution of g(x) 20). 
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However, for any p we can obtain that root using the Newton-Raphson 
algorithm by iterating 


cx 02 where g'Q) =F’) -0= f(x) =x" 
g (x;) 
[re 
=x, — à 
xe / 


With p = u = 0.371 and starting arbitrarily at X, — 1, we get the sequence: 
1.0000, 1,2902. 1.2939. 1.2939, 1.2939, 1.2939, 1.2939..... 


So we return 1.2939 as a value from the G(2,1) distribution. 


As a check, we use the pgamma() function in R to confirm that 
F, (1.2939) = 0.371 as follows: 
pgamma(1.2939,2,1) # 0.37101 


Taking K = 1,000, we now generate u,,...,u, ~ iid U (0,1) in R using the 
runif() function, and then for k = 1,...,K we solve 

1- (x, -1)e * =u, for x, 
using the NR algorithm each time. This procedure results in the sample, 
355 Me ~ Tid GUT; 
Using this sample, an estimate of 1 = EX is 1.9631, and a 9596 Cl for u 
is (1.8815, 2.0446). We see that the CI contains the true value, 2. 


R Code for Problem 5.10 
options(digits=5) 


# (a) 

-log(1-0.371) # 0.463624 

J=1000; set.seed(221); uv=runif(J,0,1) 

xv=-log(1-uv) # Generate a random sample of size 1000 from the G(1,1) dsn 
est=mean(xv); std=sd(xv); cizest4c(-1,1)*qnorm(0.975)*std/sqrt(J) 

c(est,ci) # 0.99673 0.93216 1.06130 
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# (b) 

u=0.371; x21; xv=x; for(j in 1:7) { x=x-(1-(x+1)*exp(-x)-u)/(x*exp(-x)); xv=c(xv,x) } 
xv # 1.0000 1.2902 1.2939 1.2939 1.2939 1.2939 1.2939 1.2939 

pgamma(x,2,1) # 0.371 Just checking that F(1.293860) = 0.371 
pgamma(1.2939,2,1) 4 0.37101 


K=1000; xvec=rep(NA,K); set.seed(332); for(k in 1:K){ 
u-runif(1); x21; for(j in 1:10) x=x-(1-(x+1)*exp(-x)-u)/(x*exp(-x)) 
xvec[k|-X } & Generate a random sample of size 1000 from the G(2,1) dsn 
estzmean(xvec); std=sd(xvec) 
ci=est+c(-1,1)*qnorm(0.975)*std/sqrt(K) 
c(est,ci) # 1.9631 1.8815 2.0446 


5.11 Random number generation via 
compositions 


Sometimes the most convenient way to sample from a distribution is to 
express it as a function (or composition) of two or more random variables 
which are easy to sample from. For example, to obtain two independent 
values from the standard normal distribution we may use the well-known 
Box-Muller algorithm, as follows. 


Sample u,,u, ^ iid U(0,1) and let: 
z, 2 J-2logu, cos(2zu,) 
Zz, = 4-2logu, sin(2zu,) . 


It can be shown that z,,z, ~ iid N(0,1). If we only need one value from 
the standard normal distribution then we may arbitrarily discard z, and 


return only z. 


Exercise 5.1 | Sampling from the double exponential 
distribution 


Suppose we wish to sample a value x ^ f(x), where 
f 602 0/2)€ , xem. 


Describe how to obtain x as a composition of two other values than can 
be easily sampled. 


234 


Chapter 5: Monte Carlo Basics 


Solution to Exercise 5.11 


Let R and Y be independent random variables such that R ~ Bern(0.5) 
and Y ~ G(1,1). Then U 2(2R-1)Y has the same distribution as X. 


This is because R is equally likely to be 0 as it is to be 1, and so 2R- 1 is 
equally likely to be —1 as it is to be +1. So there is a 5096 chance that U 
will be exponential ( G(1,1) ) and a 5096 chance that U will be negative 


exponential. So, obviously U has exactly the same distribution as X. For 
a formal proof, see the Note below. 


We see that a method for obtaining a value x ~ f(x) is to independently 
sample r ~ Bern(0.5) and y ~ G(1,1) , and then calculate x = (2r —1)y. 


Note: The cdf of U 2 (2R—1)Y is 
F(u)=P(U <u) 
= P((2R-DY <u) 
= EP((2R -DY <u |R) 
= P(R=0)P((2R-DY <u|R=0) 
+P(R=1)P((2R-DY <u|R=1) 


= SPY <u|R=0)+>PGY <u|R=1) 


» PP > -u) +5 PW <u) 
_ | 0/21 €? +(1/2)(0), u«0 
ig awed ees a0 


Ee 
21—(0172ye* uso. 


So U has pdf ft)- Fe | uid a | 


0-(1/2)e"(-1, u20 


That is, f(u) = Ze", —oo«u «oo, which is the same the pdf of X. 
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Exercise 5.12 Sampling from a triangular distribution 


x, 0<x<1 
Suppose we want to sample x ~ f(x) where f(x) = ; 
2-x,1<x<2 


Describe how two random variables can be combined to obtain x. 


Solution to Exercise 5.12 


Sample the two random variables r ~ Bern(0.5) and y ~ Beta(2,1). Then 
calculate x 2 ry - (1— r)(2— y). This way, there is a 50% chance that x 
will equal y, whose pdf is f (y) 2 2y, 0 « y «1, and a 5096 chance that x 
will equal z = 2- y, whose pdf is f(z) 22(2—z),1«z «2. 


A second solution is as follows. Sample u,,u, ^ iid U(0,1) and calculate 
X =u, * U,. It can easily be shown that a value of x formed in this way has 
the triangular pdf in question. 


5.12 Rejection sampling 


Some distributions are difficult to sample from using any of the already 
mentioned methods. For example, when applying the inversion technique, 
solving the equation F(x) = u may be problematic even with the aid of the 
Newton-Raphson algorithm (e.g. due to instability unless starting at very 
close to the solution). 


In such cases, one convenient and easy way to obtain a value from the 
distribution of interest may be via rejection sampling (also known as the 
rejection method or the acceptance-rejection method). This method works 
as follows. 


Suppose we want to generate a random number from a target distribution 
with density f(x). This target distribution may be continuous or discrete. 


We must first decide on a suitable envelope distribution with envelope 
density h(x) . (These are also called the majorising distribution and 


majorising density.) Ideally, the chosen density h(x) is similar in shape 
to f(x) and relatively easy to sample from. 
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We next define the following quantities: 


C — max ie 
x | h(x) 
_ f(x) 

P(X) = Bie), 


The idea here is that f(x) lies entirely beneath ch(x) except that it 
touches ch(x) at maybe only one point. Then p(x), which is called the 


acceptance probability, appropriately lies between 0 and 1 (inclusive). 
Figure 5.10 illustrates this setup. The rejection algorithm is as follows: 


1. Sample a proposed value (or candidate) x’ ~ h(x). 


/ 
X 
2. Calculate the acceptance probability p — p(x’) = n 
ch(x 
3. Generate a standard uniform value u ~ U(0,1). 
4. Decide whether to accept or reject the candidate, as follows: 
If u < p then accept x’, meaning return x = x’ and STOP. 


If u > p then reject x’, meaning go to Step 1 and REPEAT. 


Steps 1 to 4 are repeated as many times as necessary until an acceptance 
occurs, resulting in x = x’. The finally accepted value x is an observation 
from f(x). Repeating the entire procedure above another J —1 times 
independently will result in a random sample of size J from f (x). 


Figure 5.10 illustrates, with: 


f (x) = density of the Beta(4,8) distribution 
h(x) = density of the Beta(2,2) distribution 


c — max (x) = 2.45 
x (h(x) 
x’ =0.4 (example of a candidate) 
/ 
good cd. 2308 sr 


ch(x) 3.524 


In this case, if we sample u = 0.419 (for example), then we accept x’ and 
return x = 0.4. If, however, we sample u = 0.705 (say), then we reject x’ 
and propose another x’, etc. 
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Figure 5.10 Illustration of the rejection sampling algorithm 


= d ft) Probability of accepting 0.4 is p(0.4) = f(0.4)/[c*h(0.4)) 
— X 

D. =| — * A(x) = {distance P to Q} divided by (distance P to R) 
-** c*h(x) - 2.365/3.524 - 0.671 


Note 1: The rejection sampling algorithm as defined here also works 
with f(x) and h(x) in the equations replaced by any kernels of the 
target and envelope distributions, respectively. 


Note 2: The overall acceptance rate is the unconditional probability of 
acceptance and equals the area under f(x) divided by the area under 


ch(x) , which is obviously 1/c (7 0.409 in our example). 


The wastage may be defined as the overall probability of rejection, 
namely 1—1/c, and this is simply the area between f(x) and ch(x) 
(7 0.591 in our example). 


Note 3: If we consider the experiment of proposing values repeatedly 
until the next acceptance, then the number of candidates follows a 
geometric distribution with parameter 1/c, and so the expected number 
of candidates (until acceptance) is 1/(1/c) = c. 


Note 4: There are two basic principles which must be considered in 
rejection sampling: 


(i) The envelope density h(x) should be similar to the target density 
f(x) since this will minimise wastage, i.e. minimise the average 


number of proposals per acceptance, c, and hence optimise the computer 
time required. 
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(ii) The envelope distribution should be easy to sample from. 


Note 5: The idea of rejection sampling can be used to give an intuitively 
appealing account of how Bayes' theorem works. In this regard, see 


Smith and Gelfand (1992). 


Note 6: How rejection sampling works can most easily be explained by 
considering the case where f(x) defines a simple discrete distribution. 


This is the subject of the next exercise. 


R Code for Section 5.12 
X11(w=8,h=4.5); par(mfrow=c(1,1)) 


plot(c(0,1), c(0,6),typez"n" xlabz"x",ylabz"") 
xv=seq(0.001,0.999,0.01); hxvzdbeta(xv,2,2); lines(xv,hxv,lty=2,lwd=3) 


kfun=function(x){ dbeta(x,4,8) } 

# We could specify any positive function here (*) 
kO=integrate(f=kfun,lower=0,upper=1)Svalue 

# This calculates the normalising constant 
fxv=kfun(xv)/kO; # This ensures f(x) as defined at (*) is a proper density 


lines(xv,fxv, lty=1,lwd=3) 

c=max(fxv/hxv); c # 2.4472 

lines(xv,c*hxv,lty=3,lwd=3) 
legend(0,6,c("f(x)","h(x)","c*h(x)"),lty=c(1,2,3),lwd=c(3,3,3)) 
text(0.07,3,"c = 2.45") 


xval=0.4; lines(c(xval,xval),c(0, c*dbeta(xval,2,2)),Ityz1,Iwdz1) 
points(rep(xval,3), c(O,kfun(xval)/kO ,c*dbeta(xval,2,2)) , 

pch=rep(16,3), cex=rep(1.2,3)) 
text(0.43,0.05,"P"); text(0.43,2.5,"Q"); text(0.43,3.3,"R"); 
c(0,kfun(xval)/kO ,c*dbeta(xval,2,2)) 

# 0.0000 2.3649 3.5239 2.3649/3.5239 # 0.6711 
text(0.6,5.2,"Probability of accepting 0.4 is p(0.4) = f(0.4)/(c*h(0.4)) £n 
= {distance P to Q} divided by {distance P to R}\n= 2.365/3.524 = 0.671") 
c(0,kfun(xval)/kO ,c*dbeta(xval,2,2)) # 0.0000 2.3649 3.5239 
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Exercise 5.13 Illustration of rejection sampling 


Consider the Bin(2,1/2) distribution with pdf 
1/4,x=0,2 
FO ra xal | 
and suppose we want to sample from this using rejection method envelope 
g(x) 1/3, x =0,1,2, i.e. the density of the discrete uniform distribution 


over the integers 0, 1 and 2. Show that the rejection sampling algorithm 
returns 0, 1 and 2 with the correct probabilities. 


Solution to Exercise 5.13 


Here: c= max 


X 


ou: N f (x) "ama 
g(x)} 1/3 2’ cg(x) L x=1 | 
Now, suppose that we propose a very large number of proposed values 
from g(x). Then: 

* about 1/3 of these will be 0, of which about 1/2 will be accepted 

* about 1/3 of these will be 1, of which (fully) all will be accepted 

* about 1/3 of these will be 2, of which about 1/2 will be accepted. 


We see that about 2/3 of all the proposed values will be accepted, and of 
these about 2596 will be 0, 5096 will be 1, and 2596 will be 2. About 1/3 
of the candidates will be rejected, about half of these being 0 and half 
being 2. The overall acceptance rate is 1/c = 1/(3/2) = 2/3, and the wastage 
is 1-1/c - 1/3. On average, c = 1.5 candidates will have to be proposed 
until an acceptance. Thus, generation of 1,000 Bin(2,1/2) values (say) will 
require about 1,500 candidates. 


5.13 Methods based on the rejection algorithm 


The rejection method may be used in conjunction with many other 
methods. For example, the Box-Muller algorithm (mentioned earlier) is a 
basis for the Marsaglia polar method for sampling from a normal 
distribution. This method involves generating 

u,,u, ~ iid U(0,1) 
repeatedly until 

s = (2u, - 1)! +(2u, -1)* «1 


and then returning z, = (2u; - 1)J-2(logs)/s, i= 1,2. 
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The result will (eventually) be the required sample 
ZZ, ~ iid N(0,1). 


This algorithm includes a condition for rejecting the sample values u}, u, 


and involves iterating until these values are accepted (as a pair). The 
procedure may be less efficient than the Box-Muller algorithm (which 
does not involve rejection sampling and never requires more than two 
standard uniform variates) but avoids the computation of sines and cosines. 


5.14 Monte Carlo methods in Bayesian 
inference 


Most of the ideas above in this chapter are directly applicable to Bayesian 
inference. Suppose we have derived a posterior distribution or density 
f (0| x) but it is complicated and difficult to work with directly. Then we 


can try to generate a random sample from that posterior with a view to 
estimating all the required inferential quantities (e.g. point and interval 
estimates) via the method of Monte Carlo. 


First, denote the Monte Carlo sampleas @,,...,0, ~ iid f(0|x). Then, the 
MC estimate of the posterior mean of 0 , namely 

0 - E(0|x) - [6f (61:980, 
is 

J 

0- Ye, (the MC sample mean), 
j=l 


anda 1—a Cl for @ is 


Also, a MC estimate of the 1-@ CPDR for 0 is (q,,,d, ,,;), where q, 
is the empirical p-quantile of 0,,...,0, , and the MC estimate of the 


posterior median is q,,,, etc. 
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Further, when the posterior density f(0|x) does not have a closed form 
expression (as is often the case), it can be estimated by smoothing a 
probability histogram of @,,...,0,. 


Once an estimate of the posterior density has been obtained, the mode of 
that estimate defines the MC estimate of the posterior mode. 


Suppose we are interested in some posterior probability 
p- P(8€A|y) 
(where A is a subset of the parameter space). 


Then, the MC estimate of p is 
. 1 
p-—(6; e A), 

J j=l 


i.e. the proportion of the 0, values which lie in A, and a 1—« CI for p is 


(B*z,, VPA- B1). 


Suppose we are interested in a function of the parameter, w = g(0). Then 
regardless of how complicated g is, we can perform MC inference on y 
easily. Simply calculate v; = g(0,) for each j = 1,...,J. This results in a 
random sample from the posterior distribution of v , namely the values 


Visio iid f (v |x). 


One may then apply any of the ideas above, just as before. For example, 
the posterior mean of y , namely 


V - E |o) 7 [v ftv Ddy - [a(9) f(9 1:040, 
can be estimated by its MC estimate, 


= qe 
Z2 


anda 1— « CI for y is 
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Exercise 5.14 MC inference under the normal-normal-gamma 
model 


Recall the Bayesian model:  (y,,..., y, | 4, 4) ~ iid N(q,1/ A) 
f (4,24) €«1/A,u e8,A»0. 


Suppose we observe the data vector y = (y,,..., y,) = (2.1, 3.2, 5.2, 1.7). 


(a) Generate J = 1,000 values from the posterior distribution of u. Use 
this sample to perform MC inference on 4 . Illustrate your inferences with 
a suitable graph. 


(b) Generate J = 1,000 values from the posterior distribution of A . Use 
this sample to perform MC inference on 4 . Illustrate your inferences with 


a suitable graph. 


(c) Use MC methods to estimate the signal to noise ratio (SNR), defined 
asy-ulo- uva . Illustrate your inferences with a suitable graph. 


Solution to Exercise 5.14 


(a) Recall that the marginal posterior distribution of x is given by 


(f »J-i-». 


So we generate w, ,..., w, ~ iid t(n — 1) and then calculate 


E . 
H; ud. Jed. 


We then use the sample 4,,..., 4; ^ iid f (u| y) for MC inference on u. 
Thereby, we estimate ws posterior mean £ — E(u|y) by u = 3.077 
with (3.001, 3.153) as the 9596 MC CI for Z. The MC estimate of grs 
9596 CPDR is (0.685, 5.507). 


We now compare the above estimates with the true values: 
=y -3.050 


9596 CPDR for u = [7 tal = (0.556, 5.544). 


vn 
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We observe that the true posterior mean is contained in the 95% MC CI 
for that mean. Figure 5.11 provides a comparison of the above Monte 
Carlo and 'exact' inferences. 


Note 1: The formula for the exact posterior density is 


fly) = f(wly)—|= feol EZ) <r 
os) 


4 
= ora ud vin 


— x—, WER. 


(5mm c ) 


2 
du 


Note 2: The MC sample 44,..., 4; ~iid f (u| y) could also be obtained 
using the following results: 


(Aly)~ Gamma [^ [* =} 


—Ó! 


Thus, using the method of composition and the identity 


f (A1 y) = FALY fG ly. 
we first sample 


P ee Gamma (5: [873 
2 D 


and then sample 


(mel for each j=1,...,J. 


PES 
LES 


J 


The result of this procedure is 


(X). 5, A;) ~ tid f(m, A | y). 
and thereby 


tis- Hy lid f(u|y), 
as before, after discarding all of the A, values. 
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Figure 5.11 Monte Carlo inference on the normal mean 
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(b) One way to obtain a MC sample from the marginal posterior 
distribution of À is as indicated in Note 2 of part (a). Alternatively, we 
can make use of the result 


1 n 
(Aly, u) ~ Gamma|2, 2s}, where s? = DOARDE 
i=1 


So, again by the method of composition, but this time using the identity 


f (Aly) — f dy) fO y.u), 
we make use of the sample already generated in (a) and sample 


Ae Gamma. 2s; | 


for each j =1,...,J . The result is (u,,À,),..., (45, A; ) ~ iid f (p, ^ | y), and 
thereby 4,,...,4, ~ iid f (4| y) (after discarding all of the u, values). 


Implementing this procedure (i.e. making use of the simulated values in 
(a)) we obtain the required sample, 4,,..., 4, » iid f(A |y), and use it for 


MC inference. Thereby we estimate A's posterior mean A=E (Aly) by 


A = 0.3998 with (0.3804, 0.4192) as the 95% MC CI for Â. The MC 
estimate of 4's 95% CPDR is (0.0347, 1.2828). 


We now compare the above estimates with the true values: 
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Á-l =0.4071 
S 
95% CPDR = !F7,,,, ,\(0.025), F 7, , 9-1 ,\(0-975) 
qs qs 


= (0.0293, 1.2684). 


We see that the true posterior mean is contained in the 9596 MC CI for 
that mean. Figure 5.12 illustrates these Monte Carlo and ‘exact’ inferences. 


Figure 5.12 Monte Carlo inference on the precision parameter 
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(c) Using the values sampled in (a) and (b), we now calculate y; = ju; A 


foreach j =1,..,J , and hence obtain a MC sample ^,,...,»y, ~ iid f (y| y), 
which can then be used to perform MC inference on y . (NB: The symbols 
‘y’ and ‘y’ are typographically equivalent.) Implementing this strategy, 
we estimate y’s posterior mean by 1.800, with (1.745 1.854) as a 9596 CI 
for that mean, and we estimate y’s 9596 CPDR as (0.228, 3.543). 


Figure 5.13 illustrates these Monte Carlo estimates. Also shown are: 
e the exact posterior mean of ~y , which is 7 = E(y | y) = 1.793 
e the exact 9596 CPDR for ^y , which is (0.0733, 3.5952) 
e the exact posterior density of ^y 
e the MLE of y, whichis y=y/s = 3.05/1.567 = 1.946. 
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See the Note and R Code below for details of these calculations. 


Figure 5.13 Monte Carlo inference on the signal to noise ratio 
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Note: The conditional posterior distribution of y = TEN given A is 


Qr Ly 29 ~ ND) QA (3) ~ NOVA, 17n). 


This follows from the uninformative normal-normal model, i.e. from the 
fact that 


(| y,À) ~ N(y,1/(nd)). 
So the posterior density of y may be obtained numerically according to 
fly - Eifol»lyr- f fy f 1y)2^, 
where: 


n SUED 02 


POLY = trem) = me ye 


—— na 
n-1 s E x n5 


f | y) E fera: 09 E | 
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Also (as shown in a previous exercise), the posterior mean of y is 
exactly 


f - E(u NA | y) = EEG A | y, 4) | y) 


(after some algebra). 


The exact 9596 CPDR for y may be obtained by using the optim() 
function to minimise 


g(L,U) - (EU |y) -F,C.1y) 0.95) +(f,U|y)- FLID] 


T 


T FOIA FALA noss] 


2 


+ 


d 


f fly fo yas - f FEIA £13) 


with the result being (L, U) = (0.0733, 3.5952). 


R Code for Exercise 5.14 


# (a) 

y=c(2.1, 3.2, 5.2, 1.7); n=length(y); ybarzmean(y); s=sd(y); s # 1.567 
J=1000; set.seed(144); options(digits=4) 

wve=rt(J,n-1); muv=ybar+s*wv/saqrt(n) 

mubar=mean(muv); muci=mubar + c(-1,1)*qnorm(0.975)*sd(muv)/sqrt(J) 
mucpdr=quantile(muv,c(0.025,0.975)) 

c(mubar,muci,mucpdr) # 3.0770 3.0012 3.1528 0.6848 5.5069 
muhat=ybar; mucpdrtrue= ybar+(s/sqrt(n))*qt(c(0.025,0.975),n-1) 
c(muhat,mucpdrtrue) # 3.050 0.556 5.544 


X11(w=8,h=5); par(mfrow=c(1,1)) 
hist(muv,prob=T,xlab="mu",xlim=c(-2,7.5), ylim=c(0,0.5),main="", 
breaks=seq(-20,20,0.25)) 


muvec=seq(-20,20,0.01); 
postvec=dt( (muvec-ybar)/(s/sqrt(n)) , n-1 ) / (s/sqrt(n)) 
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lines(muvec,postvec, Ity=1,lwd=3) 

lines(density(muv),Ityz2,Iwdz3) 

abline(v=c(mubar,muci,mucpdr),Ity=2,lwd=3) 

abline(v=c(ybar, mucpdrtrue) , Ity=1,lwd=3) 

legend(-2,0.5,c("Monte Carlo estimates","Exact posterior estimates"), 
Ityzc(2,1),|wdzc(3,3),bgz" white") 


(b) 
lamvzrep(NA,J); set.seed(332) 
for(j in 1:J) lamv[j] = rgamma(1,n/2, (n/2)*mean((y-muv[j])^2)) 


lambar=mean(lamv); lamci=lambar + c(-1,1)*qnorm(0.975)*sd(lamv)/sqrt(J) 

lamcpdr=quantile(lamv,c(0.025,0.975)) 

c(lambar, lamci, lamcpdr) # 0.39980 0.38040 0.41920 0.03465 1.28283 

lamhat=1/s*2; lamcpdrtrue= qgamma(c(0.025,0.975),(n-1)/2,((n-1)/2)*s^2) 

c(lamhat, lamcpdrtrue) # 0.40706 0.02928 1.26844 

hist(lamv, prob=T,xlab="lam",xlim=c(0,2.5), ylim=c(0,2),main="", 
breaks=seq(0,3,0.05)) 

lamvec=seq(0,3,0.01) ; lampostvec- dgamma(lamvec,(n-1)/2, ((n-1)/2)*s^2) 

lines(lamvec, lampostvec, Ity=1,lwd=3) 

lines(density(lamv),lty=2,lwd=3) 

abline(v=c(lambar, lamci, lamcpdr),lty=2,lwd=3) 

abline(v2c(1/s^2, lamcpdrtrue), Ity=1,lwd=3) 

legend(1.5,2,c("Monte Carlo estimates","Exact posterior estimates"), 
Ityzc(2,1),Iwdzc(3,3),bgz" white") 


# (c) 
gamvzmuv*sqrt(lamv) 


gambar=mean(gamv); gamci=gambar + c(-1,1)*qnorm(0.975)*sd(gamv)/sqrt(J) 
gamcpdr=quantile(gamv,c(0.025,0.975)) 

c(gambar, gamci, gamcpdr) # 1.7997 1.7453 1.8540 0.2284 3.5433 
mle=ybar/s; mle # 1.946 


gamhat=(ybar/s)*gamma(0.5+(n-1)/2)/(sqrt((n-1)/2)*gamma((n-1)/2)) 

print(c(ybar,s,gamhat),digits=8) # 3.0500000 1.5673757 1.7928178 

intfun=function(lam,gam, ybar=3.05,s=1.5673757,n=4){ 
dnorm(gam,ybar*sqrt(lam),1/sqrt(n)) *dgamma(lam,(n-1)/2,5^2*(n-1)/2) ) 
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integrate(function(gam) { 
sapply(gam, function(gam) { 
integrate(function(lam) ( 
sapply(lam, function(lam) intfun(lam,gam) ) 
}, 0, Inf)Svalue }) }, -Inf, Inf) 
# 1 with absolute error < 4.7e-07 OK (Just checking) 


integrate(function(gam) { 
sapply(gam, function(gam) { 
integrate(function(lam) { 
sapply(lam, function(lam) gam*intfun(lam,gam) ) 
}, 0, Inf)Svalue }) }, -Inf, Inf) 
# 1.793 with absolute error « 4.7e-06 OK (Agrees with exact calculation) 


gamvec=seq(-5,10,0.01); fgamvec=gamvec 


for(i in 1:length(gamvec)){ 
fgamvec[i]=integrate( f=intfun, lowerzO, upperzInf, 
gam=gamvec[i])Svalue } 
plot(gamvec,fgamvec) # OK 


L--0.1; U=4.2 # Testing... 
integrate(function(gam) { 
sapply(gam, function(gam) { 
integrate(function(lam) { 
sapply(lam, function(lam) intfun(lam,gam) ) 
}, 0, Inf)$value }) }, LU) 
# 0.9823 with absolute error < 4.3e-08 OK 


integrate( f=intfun, lower=0, upper=Inf, gam=U)Svalue - 
integrate( f=intfun, lower=0, upper=Inf, gam=L)Svalue #-0.02074 OK 


gfun=function(v){ L=v[1]; U=v[2] 
( integrate(function(gam) { 
sapply(gam, function(gam) { 
integrate(function(lam) { 
sapply(lam, function(lam) intfun(lam,gam) ) 
}, 0, Inf)$value }) }, LU)Svalue -0.95 )42 + 
( integrate( f=intfun, lower=0, upper=Inf, gam=U)Svalue - 
integrate( f=intfun, lower=0, upper=Inf, gam=L)Svalue )^2 } 


gfun(v=c(-0.1,4.2)) #0.001473 OK 
gfun(v=c(1,3)) # 0.08562 OK 
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resO=optim(par=c(0,4),fn=gfun)Spar 
resO # 0.07334 3.59516 
res1=optim(par=res0,fn=gfun)Spar 
res1 # 0.07332 3.59518 
res2=optim(par=res1,fn=gfun)Spar 
res2 # 0.07332 3.59518 OK 


L=res2[1]; U=res2[2] # Now check... 


integrate(function(gam) { 
sapply(gam, function(gam) { 
integrate(function(lam) { 
sapply(lam, function(lam) intfun(lam,gam) ) 
}, 0, Inf)Svalue }) }, LU) 
# 0.95 with absolute error « 3.2e-07 
integrate( f=intfun, lower=0, upper=Inf, gam=L)Svalue # 0.06598 
integrate( f=intfun, lower=0, upper=Inf, gam=U)Svalue # 0.06598 All OK 
hist(gamv,prob=T,xlab="gam",xlim=c(-1,6), ylim=c(0,0.6),main="", 
breaks=seq(-2,7,0.1)) 
lines(density(gamv),Ityz2,Iwdz3) 
abline(v=c(gambar, gamci, gamcpdr),|ty=2,lwd=3) 
points(mle,0,pch=4,lwd=3,cex=2) 
lines(gamvec,fgamvec,|lty=1,lwd=3) 
abline(v=c(gamhat,L, U), Ity=1,lwd=3) 
legend(3,0.6,c("Monte Carlo estimates","Exact posterior estimates"), 
Ity=c(2,1),lwd=c(3,3),bg="white") 
text(5,0.4,"The cross shows the MLE") 


5.15 MC predictive inference via the method 
of composition 


Suppose that in the context of a Bayesian model defined by f(y|0) and 
f (0), we wish to predict a value x whose distribution is specified by 
f (x | y, 9) . Recall that the posterior predictive density is 


fed)» | fedy.0)f(1y)80. 
If this density is complicated, we may choose to perform MC predictive 


inference on x using a sample x,,...,x, ~ iid f(x| y). The question then 
arises as to how such a sample may be obtained. 
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One answer is to sample from f (x| y) directly. But that may be difficult 
since f (x| y) is complicated. Another answer is to apply the method of 
composition through the equation 


f G50|y) - f(x|y,0) Fly). 


This means that we should first sample 0' ^ f(0| y) and then sample 
x’ ~ f(x | y, 0^), the result being (x',0) ^ f (x, 0| y). If we then discard 
0' , the result is the required x' ^ f (x| y). Implementing this process a 
total of J times results in the required sample, x,,..., x, ~ iid f (x| y). 


Exercise 5.15 Monte Carlo prediction in the binomial-beta 
model 


The probability of heads coming up on a bent coin follows a standard 
uniform distribution a priori. We toss the coin 50 times and get 28 heads. 
Estimate using Monte Carlo the probability that heads will come up on at 
least six of the next 10 tosses of the same bent coin. 


Solution to Exercise 5.15 


Recall that the binomial-beta model: 
(y|0) ^ Bin(n,0) 
0 ~ Beta(o, D) , 

for which the posterior distribution is given by 
(0| y) * Beta(a 4- y, 8 4- n— y). 


Earlier we showed that if the future data x has distribution defined by 
(x| y, 0) ~ Bin(m,0), 
then posterior predictive distribution is given by 
m|B(y--x--o,n— y--m- x4- B) 
f(xl)- | o : 


x=0,...,m. 
X B(y * on— y+ B) 


Rather than trying to sample from this distribution directly, we may do 
the following: 


Sample 6’  Beta(a + y, 8 -- n— y) 
Sample x’ ~ Bin(m,0^) . 


Discarding 0’, we obtain the required sample value, x' ^ f(x| y). 
In the situation here: a = 8 =1, n = 50, y = 32, m= 10. 
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Implementing the above sampling strategy J — 10,000 times with these 
specifications, we obtain a large MC sample, x,,...,x, ~ f(x|y). 


It is found that 7,084 of the sample values are at least 6. So we estimate 
p=P(X 26|y) by p = 0.7084. A 95% CI for p is then 


(p 51.964 B(1— p)/ J ) = (0.6995, 0.7173). 


For interest, we also work out the probability exactly as 
p=>,, f(x|y) =0.7030 (correct to 4 decimals) 
and note that this value lies in the 95% CI obtained using MC methods. 


R Code for Exercise 5.15 


options(digits=5) 

n=50; y=32; alp=1;bet=1; a=alp+y; b=bet+n-y; m=10; J=10000 
set.seed(443); tv=rbeta(J,a,b); xv=rbinom(J,m,tv) 
phat=length(xv[xv>=6])/J; 
ci=phat+c(-1,1)*qnorm(0.975)*sqrt(phat*(1-phat)/J) 

c(phat,ci) # 0.70840 0.69949 0.71731 


xvec=0:m; fxgiveny= 

choose(m,xvec)* beta(y+xvectalp,n-y+m-xvectbet)/beta(y+alp,n-y+bet) 
sum(fxgiveny) #1 Just checking 
sum(fxgiveny[xvec>=6]) # 0.70296 


5.16 Rao-Blackwell methods for estimation 
and prediction 


Consider a Bayesian model with two parameters given by a specification 
of f(y|0,y) and f(0,v) , and suppose that we obtain a sample from 
the joint posterior distribution of the two parameters, say 


(8w) (0y) * iid f(0,y |y). 


As we have seen, an unbiased Monte Carlo estimate of 6’s posterior 
mean, 0 — E(0| y), is 0 2 (1/ J) 27; ,6,, with an associated MC 1-a@ 


CI for Ó given by (J £z, ,s,/ JJ), where s, is the sample standard 


aSo 


deviation of @,,...,0,. 
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Now observe that 
ó- E(E(O | y, v) | y - JEO y y) f Gv | dv . 


This implies that another unbiased Monte Carlo estimate of Ó is 


where 
e; - E(0| y,v;), 

and another 1— æ CI for Ó is 
(ez d), 


where s, is the sample standard deviation of e,,...,e, . 


If possible, this second method of Monte Carlo inference is preferable to 
the first because it typically leads to a shorter CI. We call this second 
method Rao-Blackwell (RB) estimation. The first (original) method may 
be called direct Monte Carlo estimation or histogram estimation. 


The same idea extends to estimation of the entire marginal posterior 
density of 0 , because this can be written 


Fly) =| fOly wt lydy » E, CF CO Ly, v) y3- 


Thus, the Rao-Blackwell estimate of f (0| y) is 


[01 --Y (ely), 


as distinct from the ordinary histogram estimate obtained by smoothing a 
probability histogram of @,,...,0, . 


The idea further extends to predictive inference, where we are interested 
in a future quantity x defined by a specification of f (x| y, O,v). 


The direct MC estimate of the predictive mean, namely 
&- E(x|y). 
is 


where 
Xs, dd TUE LY) 
(e.g. as obtained via the method of composition). 
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A superior estimate is the Rao-Blackwell estimate given by 
J 
E- EE T 
J j=l 
where there is now a choice from the following: 
E, zm E(x|y,0,,v;) 


or E; - E(x| y,v;) 
or E, - E(x|y,O,). 


This estimator ( E ) is based on the identities 
X= E{E(x|y,0,y)| y} = ECE(x | y,v)| y}= EUE(x| y,0)| y}. 


Note: The first of the three choices for E; is typically the easiest to 
calculate but also leads to the least improvement over the ordinary 
‘histogram’ predictor, x =(1/J)>"_, x 


Jp 


Likewise, the Rao-Blackwell estimate of the entire posterior predictive 
density f(x|y) is 


feln} f, 
where there is a choice from the following: 
fC) = f(xly,05v;) 
or f;G)e f(xly,v) 
or f,(x)= f(x|y.0;). 


Exercise 5.16 Practice at Rao-Blackwell estimation in the 
normal-normal-gamma model 


Recall the Bayesian model: 
Visio’, [i5 A) iid N (u,1/ A) 


f(u,4)«1/A,ue9,A»0. 
Suppose that we observe the vector y = (y,,..., y,) = (2.1, 3.2, 5.2, 1.7). 
Generate J = 100 values from the joint posterior distribution of 4 and A 


and use these values as follows. Calculate the direct Monte Carlo estimate 
and the Rao-Blackwell estimate of A's marginal posterior mean. 
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In each case, report the associated 9596 CI for that mean. Compare your 
results with the true value of that mean. Produce a probability histogram 
of the simulated 4-values. Overlay a smooth of this histogram and the 
Rao-Blackwell estimate of A's marginal posterior density. Also overlay 
the exact density. 


Solution to Exercise 5.16 
Recall from Equation (3.3) in Exercise 3.11 that: 
k = t, n—1 2 

2 2 


wlyr~n[ ra]. 


(A | y) ~ Gamma 


So we first sample 


A! ~ Gamma 


and then we sample 
1 
"^ N|y, il 
H y nN 


The result is 
(u X) ~ f(m Aly). 


Repeating many times, we get 


(Ho A) (zA) ~ iid f(u, A| y). 


The histogram estimate of Á-E(A | y) works out as \ — 0.4142, with 9596 
CI (0.4076, 0.4209). 


Next let e, = E(A]|y,u;). 


Then the Rao-Blackwell estimate of \ is @ = 0.4073, with associated 95% 
CI (0.4047, 0.4100). 


It will be observed that this second CI is narrower than the first (having 


width 0.0053 compared with 0.0133). It will also be observed that both 
CIs contain the true value, À —1/s? = 0.4071. 
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Figure 5.14 shows: 
e a probability histogram of A,,...,A, 
* asmooth of that probability histogram 


* the true marginal posterior density, namely 
f y= E E AY 


e the Rao-Blackwell estimate of f (À| y) as given by 


f 1 14^ 
f(Aly)— > f canman A) Where s = 2240 -py. 


Note: The Rao-Blackwell estimate here is based on the result 
1 n 
33cm | 
Tiu, 


It will be observed that the Rao-Blackwell estimate of A's posterior 
density is fairly close. The histogram estimate is much less accurate and 
incorrectly suggests that A has some probability of being negative. 


nn 
Aly,u)^ Gamma|!—,— 
(A | y, 14) 5s 


Figure 5.14 Illustration of Rao-Blackwell estimation 
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R Code for Exercise 5.16 
options(digits=4) 


# (a) 

y=c(2.1, 3.2, 5.2, 1.7); n=length(y); ybarzmean(y); s=sd(y); s2=s42 

J=100; set.seed(254); lamv=rgamma(J,(n-1)/2,s2*(n-1)/2); 
muv=rnorm(J,ybar,1/sqrt(n*lamv)); est0=1/s*2 

est1=mean(lamv); std1=sd(lamv); cil=est1 + c(-1,1)*qnorm(0.975)*std1/sqrt(J) 
ev=rep(NA,J); for(j in 1:J){ muval2muv[j]; ev[j]|-1/mean((y-muval)^2) } 
est2=mean(ev); std2=sd(ev); ci2=est2 + c(-1,1)*qnorm(0.975)*std2/sqrt(J) 
rbind( c(estO,NA,NA,NA), c(est1,ci1,ci1[2]-ci1[1]), c(est2,ci2,ci2[2]-ci2[1]) ) 
# [1,]0.4071 NA NA NA 

# [2,] 0.4396 0.3767 0.5026 0.12589 

# [3,] 0.4150 0.3892 0.4408 0.05166 


# (b) 

X11(wz8,hz5); par(mfrowzc(1,1)) 

hist(lamv,xlabz"lambda" ylabz"density",probzT,xlimzc(0,2.5), 
ylimzc(0,2.5),mainz"",breakszseq(0,4,0.05)) 

lines(density(lamv),Ityz1,Iwd-3) 

lamvec=seq(0,3,0.01); RBvec=lamvec; smu2v=1/ev 

for(k in 1:length(lamvec)){ lamval=lamvec[k] 
RBvec[k]=mean(dgamma(lamval,n/2,(n/2)*smu2v)) } 

lines(lamvec,RBvec,Ityz1,Iwdz1) 

lines(seq(0,3,0.005),dgamma(seq(0,3,0.005),(n-1)/2,s2*(n-1)/2), Ity=3,lwd=3) 

legend(1.2,2,c("Histogram estimate of posterior","Rao-Blackwell estimate", 
"True marginal posterior"), Ity=c(1,1,3),lwd=c(3,1,3)) 


5.17 MC estimation of posterior predictive 
p-values 


Recall the theory of posterior predictive p-values whereby, in the context 
of a Bayesian model specified by f(y|@) and f(0), we test H) versus 


H, by choosing a suitable test statistic T(y, 0). 


The posterior predictive p-value is then 

p - P(T(,0) 2 T(y,0)| y) 
(or something similar, e.g. with 2 replaced by <), calculated under the 
implicit assumption that H, is true. 
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If the calculation of p is problematic, a suitable Monte Carlo strategy is as 
follows: 


1. Generate a random sample from the posterior, 


0,..,0, iid f (0| y). 


2. Generate x. ~L FCy [6g = Tou 
(so that x,,..., x, iid f (x| y)). 


3. For each j = 1,...,J calculate T, = T(x,,0;) and I; 2 I(T; 2 T), 
where T 2 T(y,0). 


z Ig ] . 
4. Estimate p by p = Foo ; With associated 1- œ CI 


ja 
^ p(1— p) 
[bns = 


Exercise 5.17 Testing for independence in a sequence of 
Bernoulli trials 


A bent coin has some chance of coming up heads whenever it is tossed. 
Our uncertainty about that chance may be represented by the standard 
uniform distribution. 


The bent coin is tossed 10 times. Heads come up on the first seven tosses 
and tails come up on the last three tosses. 


Using Bayesian methods, test that the 10 tosses were independent. 
Solution to Exercise 5.17 


The observed number of runs (of heads or tails in a row) is 2, which seems 
rather small. 


Let y, be the indicator for heads on the ith toss, (i = 1,...,n) (n = 10), and 
let à be the unknown probability of heads coming up on any single toss. 


Also let x, be the indicator for heads coming up on the ith of the next n 
tosses of the same coin, tossed independently each time. 
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Further, let y = (y,,..., y,) and x = (x,..., x,) , and choose the test statistic 


as 
T(y,0) - Ry), 
defined as the number of runs in the vector y. 


Then an appropriate posterior predictive p-value is 
p = P(R(x)S R(y) | y), 
where y =(1,1,1,1,1,1,1,0,0,0) and R(y)7 2. 


Under the Bayesian model: 

(Y-Y, 10) ^ iid Bern(0) 

0 ~ U(0,1), 
the posterior is given by 

(8| y) ~ Beta(y, +1,n- y; +1), 
where y, 2 y, t... t y, 7 7. 


With J = 10,000, we now generate 
0,...,0, ~ iid Beta(8,4). 


After that, we do the following for each j =1,...,J: 


1. Sample x/,..., x/ ~ iid Bern(0;) and form the vector 


x EQUI). 


2. Calculate R, = R(x’) (i.e. calculate the number of runs in 


(x aK J): 
3. Obtain I, = I(R; < R), where R= R(y) - 2. 


Thereby we estimate p by 
„n 1% 
p=—) I, = 0.0995, 
J j=l 


with 95% CI 


| p 21.96 E) = (0.0936, 0.1054). 
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So the posterior predictive p-value is about 10 percent, which may be 
considered as statistically non-significant. That is, there is insufficient 
evidence (at the 596 level of significance, say) to conclude that the 10 
tosses of the coin were somehow dependent. 


Note 1: Using a suitable formula from runs theory, the exact value of p 
could be obtained as 


p - [P(G) € 210) fi. (0)d0 


= i » P(R(x) = 2 | 0, X,) f (x | a} panes (8)d0 , 


Xp =0 


where: 


e P(R(x) < 2|0) is the exact probability that 2 or fewer runs will 
result on 10 Bernoulli trials if each has probability of success 0 


* P(R(x) € 2|0,x,) is the probability of 2 or fewer runs will result 


when x, 1s and n— x, Os are placed in a row 


n 


CAOS | Je (1— 0)" * is the binomial density with 


XT 


parameters n and Ó , evaluated at x, . 


Note 2: It is of interest to recalculate p using data which seems even 
more 'extreme', for example, 
v= (i e Pee t t e RTT ESTIS D EUR SO C0 e 


For this data, R(y) = 2 again but with n = 20 and y = 14. In this case, 
(| y) = Beta( y; +1,n-— y; +1) ~ Beta(15,7), 
and we obtain the estimate p = 0.0088 with 9596 CI (0.0070 0.0106). 


Thus there is, as was to be expected, much stronger statistical evidence 
to reject the null hypothesis of independence. 
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R Code for Exercise 5.17 


R=function(v){m=length(v); sum(abs(v[-1]-v[-m])) 4*1] 
# Calculates the runs in vector v 

R(c(1,1,1,0,1))#3 testing... 

R(c(1,1)) #1 

R(c(1,0,1,0,1)) #5 

R(c(0,0,1,1,1)) #2 

R(c(1,0,0,1,1,0,0,1,1,1,1,0)) #6 — ...all OK 


n=10; J=10000; Iv=rep(0,J); set.seed(214); tv=rbeta(J,8,4) 
for(j in 1:J){ xjerbinom(n,1,tv[j]); if(R(xj)<=2) Iv[j]21 } 
p=mean(Iv); ci-p*c(-1,1)*qnorm(0.975)*sqrt(p*(1-p)/J) 
c(p,ci) # 0.09950 0.09363 0.10537 


n=20; J=10000; Ivzrep(0,J); set.seed(214); tv=rbeta(J,15,7) 
for(j in 1:J){ xjerbinom(n,1,tv[j]); if(R(xj)<=2) Iv[j]21 } 
p=mean(Iv); ci-p*c(-1,1)*qnorm(0.975)*sqrt(p*(1-p)/J) 
c(p,ci) # 0.008800 0.006969 0.010631 
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6.1 Introduction 


Monte Carlo methods were introduced in the last chapter. These included 
basic techniques for generating a random sample and methods for using 
such a sample to estimate quantities such as difficult integrals. This 
chapter will focus on advanced techniques for generating a random 
sample, in particular the class of techniques known as Markov chain 
Monte Carlo (MCMC) methods. Applying an MCMC method involves 
designing a suitable Markov chain, generating a large sample from that 
chain for a burn-in period until stochastic convergence, and making 
appropriate use of the values following that burn-in period. 


Like other iterative techniques such as the Newton-Raphson and 
Expectation-Maximisation algorithms, MCMC methods require an 
arbitrary starting point (or vector) and then involve iterating repeatedly 
until convergence. But MCMC methods are distinguished from these 
other methods by the fact that the update at each iteration is not 
deterministic but stochastic, with the probability distributions involved 
dependent on results from the previous iteration. 


Typically, MCMC methods are used to sample from multivariate 
probability distributions rather than univariate ones. This is because a 
univariate distribution can usually be sampled from using simpler 
methods. Nevertheless, we will begin our discussion of MCMC methods 
with a description of the Metropolis algorithm for sampling from 
univariate distributions, because that algorithm constitutes a basic 
building block for the more advanced methods. 


6.2 The Metropolis algorithm 


Suppose that we wish to sample from a univariate distribution with pdf 
f(x) for which rejection sampling and the other techniques described 
previously are problematic (say). Then another way to proceed is via the 
Metropolis algorithm. This is an example of Markov chain Monte Carlo 
(MCMC) methods. The Metropolis algorithm may be described as 
follows. 
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As with the Newton-Raphson algorithm, we begin by specifying an initial 
value of x, call it x,. We then also need to specify a suitable driver 
distribution which is easy to sample from, defined by a pdf, 

g(t|x). 


For now, we will assume the driver to be symmetric, in the sense that 
g(t|x) — g(x|t), 

or more precisely, 
g(t=a|0=b)=g(t=b|0=a) Vaden. 


Note: The driver distribution may also be non-symmetric, but this case 
will be discussed later. 


We then do the following iteratively for each j = 1,2,3,...,K (where K is 
‘large’): 
(a) Generate a candidate value of x by sampling x ~ g(t|x,4). We 
call x the proposed value and g(t|x; ,) the proposal density. 


/ 
X. 
(b) Calculate the acceptance probability as p — EE 


f(x) 


Note: If p > 1 then we take p = 1. Also, if x is outside the range of 


possible values for the random variable x, then f (xj) = 0 and so p 7 0. 


(c) Accept the proposed value x with probability p. 
To determine if x! is accepted, generate u ~ U (0,1) 


(independently). If u < p then accept EL and otherwise reject x 


(d) If x; has been accepted then let x, = x’, and otherwise let 


Xj, =X) 4 (i.e. repeat the last value x ja in the case of a rejection). 


This procedure results in the realisation of a Markov chain, 


Xy X,,X,,.., Xy. Ihe last value of this chain, x, , may be taken as an 
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observation from f(x), at least approximately. The approximation will 
be extremely good if K is sufficiently large. 


If we want a random sample of size J from f(x), then the whole 
procedure can be repeated another J —1 times, each time using either the 
same starting value x, or a different one. 


If K is sufficiently large, stochastic convergence will be achieved within 
K iterations, regardless of the point(s) from which the algorithm is started. 
Relabelling the last value, x, , in the jth chain as x ; (J=L..J5) leads 


the required sample, namely x,...,x, iid f(x). 


Generating a chain of length K a large number times J may be considered 
wasteful of computer resources. So typically only one long chain is 
generated, of length K=B+J, where B is sufficiently large for 
stochastic convergence to be achieved from the single starting value, x,, 
and J is again the required sample size. Discarding the results of the first 
B iterations (called the burn-in, including also x,) and relabelling the last 
J values of the chain appropriately, the result will be the sample 


Xps Xy ~ (8): 


A problem with this second method of generating the sample values is that 
they will be autocorrelated to some extent i.e. not a truly random (iid) 
sample from the distribution f(x). We will later discuss this issue and 
how to deal with the problems that may arise from it. For the moment, we 
stress that x,,..., x, will be approximately a random sample from f(x). 
Moreover, if J is sufficiently large, then these values will be effectively 
independent. This means that a probability histogram of these values will 
in fact converge to f(x) as J tends to infinity. 


Exercise 6.1 A simple application of the Metropolis algorithm 
Illustrate the Metropolis algorithm by generating a sample of size 400 


from the distribution defined by the density 
f(x) 2 6x:,0«x«1. 


Note: This is just the Beta(6,1) density and could be sampled from easily 
in many other ways. 
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Solution to Exercise 6.1 
Let us specify the driver distribution as the uniform distribution from 


x—c to x+c, where c is a tuning parameter whose value is to be 
determined (as discussed further below). Thus the driver density is 


1 
g(t|x) 2 —,x—-c«t«x-c, 
2c 
or equivalently 


gx) - -I(t- x|«o. 
C 


Note: This driver is symmetric, since 
git qx -—by—gmt-B9x-—a)y a GER 


The jth iteration of the algorithm involves first sampling a candidate value 
(or proposed value) from the driver distribution centred at the last value, 
namely 


/ 
X, *U(x; 4—6X, 4 €), 
and then accepting this candidate value with probability 
5 
f(x= xj) _ Bx! 
f(x=x;,) Bx, 
where p is taken to be: 
0 in the case where x, <0 or x; >1 


/ 


, (6.1) 


X, 


1 in the case where x; , < x, eT. 


Note: The cancellation of 6s in (6.1) illustrates an attractive feature of 
the Metropolis algorithm generally: only the kernel of the sampling 


density is needed. Here, the kernel of the sampling density f(x) = 6x^ 
is k(x) =x’. This fact can be very useful in more complicated situations 
where only the kernel of the sampling density is known. 


Starting from x,- 0.1 and with c = 0.15 (arbitrarily), we obtain a Markov 
chain of length K = 500, with values as illustrated in Figure 6.1. 


Some of the values of this chain are as follows: 
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e Sere ree ; 
X. nisse Kgg à 
x xX 


491»** ^500 = 


0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1861, 0.2650, 0.2650, 
0.4065, 0.4388, 0.4388, .......... ; 


0.9261, 0.9987, 0.9987, 0.9987, 0.9987, 0.9725, 0.8889, 0.8889, 
0,9672, 0,9315, ausa ; 


0.8058, 0.6811, 0.6073, 0.4587, 0.4353, 0.3462, 0.3462, 0.4177, 
0.4177, 0.4656. 


Note: There were four rejections until the first acceptance, at iteration 
5, where x, = x; = 0.1861, as underlined above. 


Figure 6.2 shows a probability histogram of the last J = 400 values, 
together with the exact density of x. It would appear that stochastic 
equilibrium has been achieved by about iteration 50. So we may, very 
conservatively, discard the first B = 100 iterations as the burn-in. 


The acceptance rate (AR) for this Markov chain is found to be 6496, 
meaning that 320 of the 500 candidate values i were accepted and 3696 


(or 180) were rejected. 


Figure 6.1 Trace of sample values with tuning constant c = 0.15 
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Figure 6.2 Probability histogram with tuning constant c = 0.15 


density 


Changing the tuning parameter 


What happens if we make the tuning parameter c = 0.15 larger? Figures 
6.3 and 6.4 are a repeat of Figures 6.1 and 6.2, respectively, but using 
simulated values from a run of the Metropolis algorithm with c = 0.65. 


In this case the acceptance rate is only 20.8% and the histogram is a poorer 
estimate of the true density (to which it would however converge as 
J — oo) . We say that the algorithm is now displaying poor mixing 
compared to results in the first run of 500 where c = 0.15. 


What happens if we make c = 0.15 smaller? Figures 6.5 and 6.6 are a 
repeat of Figures 6.1 and 6.2, respectively, but using simulated values 
from a run of the Metropolis algorithm with c = 0.05. 


In this case the acceptance rate is higher at 8396, there is greater 
autocorrelation, and the histogram is again a poorer estimate of the true 
density (to which it would however still converge as J — oo). We again 
say that the algorithm is mixing poorly. 


It is important to stress that even if the algorithm is mixing poorly 
(whether this be due to the tuning constant being too large or too small), 
it will eventually (with a sufficiently large value of J) yield a sample that 
is useful for inference to the desired degree of precision. Tweaking the 
tuning constant is merely a device for optimising computational efficiency. 
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Figure 6.3 Trace of sample values with tuning constant c = 0.65 
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Figure 6.4 Probability histogram with tuning constant c = 0.65 
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Figure 6.5 Trace of sample values with tuning constant c = 0.05 
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Figure 6.6 Probability histogram with tuning constant c = 0.05 
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R Code for Exercise 6.1 


MET <- function(K,x,c){ 
# This function performs the Metropolis algorithm for a simple model. 


# Inputs: K = total number of iterations 

# x = initial value of x 

# c = tuning parameter. 

# Outputs: Svec = vector of (K+1) values of x 
# Sar = acceptance rate 
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vec <- x; ct <- 0 
for(j in 1:K){ 
prop «- runif(1,x-c,x+c) 
p«-0 
if((prop»0) && (prop«1)) p <- (prop/x)^5 
u «- runif(1) 
if(u < p)t 
x «- prop 
ct «-ct * 1 
} 
vec «- c(vec,x) 
) 
ar «- ct/K 
list(vec=vec,ar=ar) 


} 
K <- 500; X11(w=8,h=4.5); par(mfrow=c(1,1)) 


set.seed(316); res <- MET(K=K,x=0.1,c=0.15) 
plot(0:K,resSvec,type="I" 


Xlabz"iteration",ylabz"x", main="") 


hist(resSvec[-(1:101)], prob=T,xlim=c(0.4, 1), ylim=c(0,6), 
xlab="x",ylab="density",main="") 
lines(seq(0.4,1,0.01),6*seq(0.4,1,0.01)^5); resSar #0.64 


print(resSvec[1+c(0,1:10,301:310,491:500)], digits=4) 

# [1] 0.1000 0.1000 0.1000 0.1000 0.1000 0.1861 0.2650 0.2650 0.4065 0.4388 
# [11] 0.4388 0.9261 0.9987 0.9987 0.9987 0.9987 0.9725 0.8889 0.8889 0.9672 
# [21] 0.9315 0.8058 0.6811 0.6073 0.4587 0.4353 0.3462 0.3462 0.4177 0.4177 
# [31] 0.4656 


set.seed(322); res <- MET(K=K,x=0.1,c=0.65) 
plot(0:K,resSvec,type="I" 


Xlabz"iteration",ylabz"x", mainz" ") 
hist(resSvec[-(1:101)], prob=T,xlim=c(0.4,1),ylim=c(0,6),xlab="x", 


ylab="density", mainz" ") 
lines(seq(0.4,1,0.01),6*seq(0.4,1,0.01)^5); resSar #0.208 


set.seed(302); res «- MET(K=K,x=0.1,c=0.05) 
plot(0:K,resSvec,type="I",xlab="iteration",ylab= 


x", main= 
hist(resSvec[-(1:101)], prob=T,xlim=c(0.4,1),ylim=c(0,6),xlab="x", 
ylab="density", main=" ") 


lines(seq(0.4,1,0.01),6*seq(0.4,1,0.01)^5); resSar #0.83 
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Exercise 6.2 Sampling from a normal distribution via the 
Metropolis algorithm 


Use the Metropolis algorithm and a uniform driver to sample 10,000 
values from the standard normal distribution. 


Check your result by comparing the sample mean and sample standard 
deviation of your sample to the true theoretical values, 0 and 1. 


Calculate a Monte Carlo 9596 confidence interval for the normal mean, 0. 
Solution to Exercise 6.2 


Aye 
Since f(x)o e? ,the acceptance probability at iteration j is given by 


Using the same uniform driver as in Exercise 6.1, x, 7 5 and c = 2.5 


(where this tuning constant was chosen after some experimentation), we 
obtain a Markov chain of length K = 10,500, as shown in Figure 6.7. 


Figure 6.8 shows a histogram of the last J = 10,000 values, together with 
the standard normal density overlaid. 


We have very conservatively discarded the first B — 500 iterations as the 
burn-in. The acceptance rate for this Markov chain is 56.196. 


The average of the J sampled values is 0.0355 (close to 0) and their sample 
standard deviation is 1.0047 (close to 1). These values lead to a 9596 CI 
for the normal mean equal to (0.0158, 0.0552). We note that this CI does 
not contain the true value, 0, as one might expect. The underlying issue 
behind this fact will be discussed generally in the next section. 
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Figure 6.7 Trace of sample values 
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R Code for Exercise 6.2 


MET <- function(K,x,c){ 
# This function performs the Metropolis algorithm to sample from the 
# standard normal dsn. 


# Inputs: K = total number of iterations 

# x = initial value of x 

# c = tuning parameter. 

# Outputs: Svec = vector of (K+1) values of x 
H Sar = acceptance rate. 
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vec-x;ct- 0 

for(j in 1:K){ prop = runif(1,x-c,x4c) 
p = exp(-0.5*(prop^2-x^2)); u - runif(1) 
if(u <= p){ x = prop; ct=ct+1 } 
vec <- c(vec,x) } 

ar =ct/K; list(vec=vec,ar=ar) } 


B=500; J = 10000; K 2 B +J 
set.seed(117); res <- MET(K=K,x=5,c=2.5); resSar # 0.548381 
X11(w=8,h=4.5); par(mfrow=c(1,1)) 


n LATI " ") 


plot(0:K,resSvec,type="I",xlab="iteration",ylab="x",main= 

hist(resSvec[-(1:(B+1))],prob=T,xlim=c(-4,4), ylim=c(0,0.5),xlab="x", 
ylab="density",nclass=50, main=" ") 

lines(seq(-4,4,0.01),dnorm(seq(-4,4,0.01)),lwd=2) 

est=mean(resSvec[-(1:(B+1))]); std=sd(resSvec[-(1:(B+1))]) 

ci=est+c(-1,1)*qnorm(0.975)*std/sqrt(10000) 

c(est,std,ci) # 0.03550254 1.00470749 0.01581064 0.05519445 


6.3 The batch means method 


As stated earlier, the output from the Metropolis algorithm leads to a 
sample, x,,.., x, , from the target density, f(x), which exhibits some 
degree of positive autocorrelation. 


This does not present a major problem when one is interested in 
calculating only point estimates. For example, if we wish to estimate the 


distribution mean EX =] xf (x)dx , each sample value x ; has expected 


value EX, and this is true regardless of how severely the simulated values 
are correlated (assuming that all the simulated values are collected after 
stochastic convergence). Therefore, the expected value of the Monte 
Carlo mean is also exactly EX (or very nearly so). 


However, when one uses a severely and positively autocorrelated Monte 
Carlo sample to calculate the standard 1— confidence interval for a 
quantity such as EX, the true coverage probability of that interval may be 
far less than the intended nominal value of 1— a . 


One way of dealing with this problem is to generate J independent chains 
and take the last value in each chain. Note that this was our original 
formulation of the Metropolis algorithm (i.e. for sampling a single value). 
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Another option is to generate a single long chain, of length K = B + 10J 
(say) and thin it out by recording only every 10th value in the chain after 
burn-in. Even so, there will still be some autocorrelation remaining in the 
J resulting values. The autocorrelation could be reduced further by 
changing 10 to 100, say; but this would be at the cost of a 10-fold increase 
in computer time needed. 


A more efficient solution to the autocorrelation problem is the batch 
means method. We will now describe how this works for when we wish 


to construct a 1- CI for EX =] xf (x)dx based on an autocorrelated 
sample x,,...,x, “iid f(x). 


The batch means CI will be different from the ordinary CI, namely 
(X 1.96s, / JJ ), where x and s, are the sample mean and sample 
standard deviation of x,,..., x, . The batch means CI is obtained as follows. 


First, break up the J sample values into m batches of size n each, so that: 
Batch 1 contains values 1,...,n (the first n values) 


Batch 2 contains values n + 1,...,2n (the next n values) 


Batch m contains values (m—1)n-1,..,J (the last n values). 


Next: Let y, be the mean of the n x, -values in the kth batch (k = 1,...,m). 


2 s 
Let s, be the sample variance of y,,..., Ym- 


Note: Thus s? — mA — yy, where y= = yy; =X is the 
7 m—1 k=1 m k=1 
mean of the batch means and identical to the mean of all J x j -values. 


Finally, compute the 1— œ batch means CI for EX as (x +1.96s, / Im ). 
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Discussion 


The rationale for the batch means method is as fol 


lows. If the batch size n 


is sufficiently large then, by the central limit theorem, 


yp» Y, * lid N(u,07 / n), 
where j| — E(x,) and o° =Var(x,). 


Consequently, 


since J =mn. 


Therefore a 1— € Cl for u is 
(Feza iNi) 


where r is an estimate of c . 
B * E 2 
Now, an unbiased estimator of o^ /n is Sj. 
E B * 2 
So an unbiased estimator of c? is ns, 


It follows that a 1— æ Cl for p is 


(xx Zapis, IJ) = (X E Z5, / m). 


Exercise 6.3 Testing the batch means method 


We wish to perform Monte Carlo estimation of the expected value of X 


whose pdf is given by f(x) x^,0«x«2. 


Note: Here, X ~ 2Beta(3,1) and so EX -2x3/( 


34-1) 21.5. 


(a) Use the Metropolis algorithm to generate a sample of size J = 1,000 


from X's distribution after a burn-in of 100. 


Then use this sample to estimate EX, together with a 9596 confidence 


interval for EX. For this CI use the formula (x d 
is the sample variance of the J sampled X-values 


+ 1.96s / "Fi? where s? 
. Also draw a histogram 


of the J X-values overlaid with the exact pdf of X. 
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(b) Use the output from the Metropolis algorithm in (a) to construct 
another 9596 CI for EX, one using the batch means method, as follows: 


Divide the J = 1,000 iterations into m = 20 consecutive batches, 
each having n = 50 values of X. 


Let y, be the average of the n X-values in the kth batch 
(k= 1,...,m). 


Let s; be the sample variance of the m batch means 
Yisscy Ya: 


Let the confidence interval for EX be (x £1.96s, / Vm Js 


(c) Conduct a Monte Carlo experiment to assess the quality of the two CIs 
for EX in (a) and (b). 


Do this by implementing the following three-step procedure a total of 
R = 100 times: 


(i) Run the Metropolis algorithm in (a) so as to generate 
J = 1,000 observations from f(x). 


(ii) Calculate the CI in (a) and count 1 if 1.5 is in it. 
(iii) Calculate the CI in (b) and count 1 if 1.5 is in it. 


Now divide the total count from (ii) by R to get an unbiased point estimate 
of the probability that the ordinary CI for EX in (a) contains EX. 


Similarly, divide the two total count from (iii) by R to get an unbiased 
point estimate of the probability that the batch means CI for EX in (b) 
contains EX. 


Also produce 9596 CIs for the two probabilities just mentioned. 


(d) Repeat the experiment in (c) but with the following in place of (i): 


Generate J — 1,000 observations from X's distribution using the 
rbeta() function. 
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Solution to Exercise 6.3 
(a) Let us specify a uniform driver centred at the last value and with half- 
width h. We now iterate as follows after choosing a suitable starting value 


of x: 
Sample x' ~U(x—h,x+h). 


If x' is outside the interval (0,2) then automatically reject x'. 


Otherwise accept x’ with probability min(1, p), where 
12 2 
pex x. 


Starting from x = 1 with h = 0.7, we get an acceptance rate of 5596 and 
simulated values as depicted in Figures 6.9 and 6.10. 


Taking the last 1,000 values of x as a random sample from f(x) we 


estimate EX as 1.539, with ordinary 9596 CI (1.467, 1.611). We note that 
this CI does not contain the true value, 1.5. 


Figure 6.9 Trace of sample values 
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10 
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Figure 6.10 Histogram of sample values 
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(b) Applying the batch means method with m = 20 and n= 50, we estimate 
EX as 1.539 again, but with 95% CI (1.467, 1.611). Note that this CI is 
wider than the CI in (a) and does contain the true value, 1.5. 


(c) After conducting the experiment we estimate p,, the true probability 


content of the ordinary 95% CI in (a), as 52.0%, with 95% CI 42.2% to 
61.8%. 


We also estimate p, , the true probability content of the batch means 95% 


CI in (b) (with m = 20 and n= 50), as 90.0%, with 95% CI 84.1% to 
95.996. 


We see that in this example the batch means method has performed far 
better than the ordinary method for constructing 9596 CIs for EX from the 
output of a Metropolis algorithm. 


(d) Generating each value of X as twice a random number from the 
Beta(3,1) distribution, we estimate p, by 92.096, with 9596 CI 86.796 to 


97.3%. We also estimate p, by 90.096, with 9596 CI 84.196 to 95.9%. 


We see that the two CIs have performed about equally well when 
calculated using a truly random sample from X's distribution. In such 
situations, the batch means CI is in fact slightly inferior and the ordinary 
CI should be used. 
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R Code for Exercise 6.3 


it (a) 
MET «- function(Jp,x,h){ 
# This function implements a simple Metropolis algorithm. 


H Inputs: Jp = total number of iterations 

H x = starting value of x 

H h - halfwidth of uniform driver. 

# Outputs: Sxv = vector of x-values of length (Jp + 1) 
# Sar = acceptance rate. 


xv <- x; ct <- 0 
for(j in 1:Jp){ xprop <- runif(1,x-h,x+h) 
if( (xprop»0) && (xprop«2) )( 
p «- xprop^2 / x^2; u «- runif(1) 
if(u < p){ x <- xprop; ct «- ct- 1) } 
xv <- c(xv,x) } 
list(xv=xv,ar=ct/Jp) } 


Jp <- 1100; set.seed(151); res <- MET(Jp=Jp,x=1,h=0.7); resSar #0.5454545 


X11(w=8, hz4.5); par(mfrow=c(1,1)); 
plot(O0:Jp,resSx,type="I",xlab="j",ylab="x_j") 
xv <- resSxv[-c(1:101)]; J= length(xv) 


hist(xv,xlab="x", prob=T,ylim=c(0,2),nclass=20,ylab="density", main="") 
xvec «- seq(0,2,0.1); fvec «- (3/8)*xvec^2; lines(xvec,fvec) 


EXhat <- mean(xv); sdhat <- sqrt(var(xv)); sdhat #0.3755086 
EXci «- EXhat * c(-1,1)*qnorm(0.975)*sdhat/sqrt(J) 
c(EXhat,EXci) # 1.538984 1.515710 1.562258 


# (b) 
m «- 20; n «- 50; yv «- rep(NA,m) 
for(k in 1:m){ xvsub <- xv[ ((k-1)*n+1):(k*n) ] 


yv[k] <- mean(xvsub) } 
sdhat2 <- sqrt(n*var(yv)); sdhat2 # 1.15783 
EXci <- EXhat + c(-1,1)*qnorm(0.975)*sdhat2/sqrt(J) 
c(EXhat,EXci) # 1.538984 1.467222 1.610746 


# (c) 


R«- 100; m «- 20; n «- 50; J «- 1000; burn «- 100; EX «- 1.5; ct1 «- 0; ct2 «- 0; 
yv «- rep(NA,m); set.seed(214) 
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for(r in 1:R){ 
xv <- METJp=burnt+J,x=1,h=0.7)Sxv[-c(1:101)] 
# xv <-rbeta(J,3,1)*2 — &foruse in (d) (see below) 
for(k in 1:m){ xvsub <- xv[ ((k-1)*n+1):(k*n) ] 
yv[k] «- mean(xvsub) ) 
EXhat «- mean(xv); sdhat1 «- sqrt(var(xv)); sdhat2 «- sqrt(n*var(yv)) 
cil <- EXhat + c(-1,1)*qnorm(0.975)*sdhat1/sqrt(J) 
ci2 «- EXhat * c(-1,1)*qnorm(0.975)*sdhat2/sqrt(J) 
if( (EX >= ci1[1]) && (EX <= ci1[2])) ct1 <- ct1 +1 
if( (EX >= ci2[1]) && (EX <= ci2[2])) ct2 <-ct2+1 } 
date() # took 2 secs 


p1 «- ct1/R; p2 «- ct2/R 

p1ci <- p1 + c(-1,1)*qnorm(0.975)*sqrt(p1*(1-p1)/R) 
p2ci <- p2 + c(-1,1)*qnorm(0.975)*sqrt(p2*(1-p2)/R) 
c(p1,pici) #0.5200000 0.4220802 0.6179198 
c(p2,p2ci) # 0.9000000 0.8412011 0.9587989 


# (d) 

# Repeat code in (c) but with the line 

H "xv <- MET(Jp=burn+J,x=1,h=0.7)Sxv[-c(1:101)]" 

# replaced by the line "xv <- rbeta(J,3,1)*2". 

H The results should be: 

# c(p1,p1ci) & 0.9200000 0.8668275 0.9731725 
# c(p2,p2ci) #0.9000000 0.8412011 0.9587989 


Exercise 6.4 Bayesian inference via the Metropolis algorithm 


The prior on a normal mean yw is uniform from zero to infinity. Values 
are sampled repeatedly from the N (4,1) distribution until n = 4 positive 
values have been observed, resulting in the data: 0.1, 0.2, 1.9, 0.8. 


Find the posterior mean of 4 in the following ways: 
(a) exactly, using numerical integration in R 


(b) approximately, using a Monte Carlo method that does not involve 
Markov chains 


(c) approximately, using the Metropolis algorithm with a normal driver. 
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Solution to Exercise 6.4 


(a) The posterior density of 4 is 
1 
n sie 


f(uly) oc arome] a 


-1-o(-4). 


since P(y>0|u)=1-P[ 2< 


Thus f(u|y)ec(1-(-4)) " ev (53,0: - 2] 
-Q0-ec4)) )" exp{ -3 [(n- Ds? + n(y - 2) a) 


e (1- 6(-4)) " epf -inu - yy) 
=k(u), u»0 (this is the kernel of the posterior density). 


[ukGidu : 

Thus H= BUIE- 
fekdu ° 
0 


where I, = f akdun, q=0,1. 
0 


Using integrate() in R we obtain I, = 4.328041, I, = 2.328058 and hence 
ft = 0.5379. 


fu 0-86)" hd a 
(b) Observe that 2-2 , 


u^ (1-9 C4)) " Aa 


where h() = 
1-9 
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Note: h(x) is the density of the N(y,1/n) distribution restricted to 
the positive real line. 


Thus i - =, where E, -E[^(i- 9742)". ge 
0 


u~ h(u) * N(y,1/n)(u» 0). 
Note: At this point we ‘forget’ about the posterior distribution of 4. 


We see that a non-Markov chain Monte Carlo estimate of / is 


e 


2. qa -n 
where: E,- 72,4 1-9 Cu)) 
j=l 
lhs- H; ~ iid h(u). 


Note: To obtain the required sample here, we repeatedly sample 
u~ N(y,1/n) until J positive values have been achieved. 


Implementing this strategy in R using the rnorm() function with a Monte 
Carlo sample size of J - 100,000, we obtain E, = 3.7059926, 


E, = 1.9900593 and hence /; = 0.5370. 


(c) Using the Metropolis algorithm and a normal driver distribution with 
standard deviation 0.5, we obtain a Markov chain of size 10,000 following 
a burn-in of size 100. The acceptance rate is found to be 59%. 


Then taking every 10th value results in a very nearly uncorrelated sample 
of size 1,000 from the posterior distribution of 4. Using these 1,000 


values, leads to the estimate by 0.5297, with associated 95% CI equal 
to (0.5047, 0.5547). 


We note that the true exact value calculated in (a), 0.5379, is contained in 
this CI. 
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R Code for Exercise 6.4 


# (a) 

y=c(0.1, 0.2, 1.9, 0.8); n = length(y); ybarzmean(y); c(n,ybar) # 4.00 0.75 
kfun-function(mu)( — exp(-0.5*n*(mu-ybar)^2) / (1-pnorm(-mu)^n } 
topfun=function(mu){ mu * kfun(mu) } 

par(mfrow=c(2,1)); muvec=seq(0,5,0.1) 
plot(muvec,kfun(muvec),type="I"); abline(h=0,lty=3) # OK 
plot(muvec,topfun(muvec),typez"I"); abline(hzO,Ityz3) # OK 
top=integrate(f=topfun,lower=0,upper=5)Svalue 
bot=integrate(f=kfun,lower=0,upper=5)Svalue 

c(bot,top,top/bot) # 4.328041 2.328058 0.537901 


# (b) 

J=110000; set.seed(551); samp=rnorm(J,ybar,1/sqrt(n)) 
samppos-samp[samp?0]; length(samppos) # 102763 
samppos-samppos[1:100000] 
numerzmean(samppos*(1-pnorm(-samppos))^(-n) ) 
denomzmean( (1-pnorm(-samppos))^(-n) ) 
c(numer,denom,numer/denom) # 1.9900593 3.7059926 0.5369842 


# (c) 
MET <- function(K,mu,del,y){ 
# This function implements a simple Metropolis algorithm. 


# Inputs: K = total number of iterations 

# mu = Starting value of mu 

# del = standard deviation of normal driver 

# y = data vector 

# Outputs: Smuv = vector of mu-values of length (K + 1) 
# Sar = acceptance rate 


muv = mu; ct = 0; n=length(y); ybarzmean(y) 
kfun=function(mu,ybar,n){ —exp(-0.5*n*(mu-ybar)^2) / (1-pnorm(-mu))^n } 
for(j in 1:K){ muprop = rnorm(1,mu,del) 
if( muprop>0 )( 
p=kfun(mu=muprop,ybar=ybar,n=n)/kfun(mu=mu,ybar=ybar,n=n) 
u=runif(1); if(u < p)}{ mu =muprop; ct=ct+1} ) 
muv = c(muv,mu) } 
list(muv=muv,ar=ct/K) } 


K=10100; set.seed(352); res= MET(K=K,mu=1,del=0.5,y=y) 
resSar # 0.590297 


mean(resSmuv) $t 0.5303868 = preliminary estimate 


plot(0:K,resSmuv,type="I") 
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veci-res$muv[-(1:101)] 
print(acf(vec1)Sacf[1:10],digits=2) # Evidence of strong autocorrelation 
# 1.00 0.78 0.61 0.48 0.39 0.30 0.24 0.19 0.14 0.11 


v-veci[seq(10,10000,10)] # Take every 10th value only 
print(acf(v)Sacf[1:10],digits=2) # No apparent residual autocorrelation 
4 1.0000 0.0534 0.0014 0.0331 -0.0089 -0.0041 0.0034 0.0087 0.0102 0.0133 


J=length(v); J # 1000 
est-mean(v); std=sd(v); ciz=est+c(-1,1)*qnorm(0.975)*std/sqrt(J) 
c(est,std,ci) # 0.5296887 0.4039238 0.5046537 0.5547237 


6.4 Computational issues 


Numerical issues may arise when attempting to calculate the acceptance 
probability 


p= f (x;)/ f(x) 


due to f ea) or f(x; ,) being too large or too small for R to handle. 


One relevant fact here is that in R on most computers (at present), 5e-324 
(meaning 5x10 ?^ ) is the smallest representable non-zero number. This 
problem can often be resolved by calculating p as 

p= exp(q) 
after first computing 

q — log f (X) log f (x, ,), 


but even this formulation may not be sufficient in every situation. 


It may sometimes also be necessary to replace the calculation of a function, 
say h(r), by 

h(max(r,5e — 324)) 
if that function requires a non-zero argument r which is likely to be 
reported by R as 0 (because the exact value of r is likely to be between 0 
and 5e — 324 ). 


Further, and by the same token, if 

0 « h(max(r,5e —324)) « 5e—324 
then R will report a value of 0. In that case, if a non-zero value of 
h is absolutely required (for some subsequent calculation) then the 
code for h(r) should be replaced by code which returns 


max(h(max(r,5e — 324)),5e — 324). 
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6.5 Non-symmetric drivers and the general 
Metropolis algorithm 


In some cases, applying the Metropolis algorithm as described above may 
lead to poor mixing, even after experimentation to decide on the most 
suitable value of the tuning constant. 


For example, if the random variable of interest is strictly positive with a 
pdf f(x) which is positively skewed and highly concentrated just above 
0 (for example, if f(x) — o» as x4 0), proposing a value symmetrically 


distributed around the last value may lead to many candidate values which 
are negative and therefore automatically rejected. 


In such cases, the support of X may not be properly represented, and it 
may be preferable to choose a different type of driver distribution, one 
which adapts ‘cleverly’ to the current state of the Markov chain. 


This can be achieved using the general Metropolis algorithm which 
allows for non-symmetric driver distributions. As before, let g(t |x) 
denote a driver density, where t denotes the proposed value and x is the 
last value in the chain. Then at iteration j, after generating a proposed 
value from the driver distribution, 

x ~ g(t|x= X as 
the acceptance probability is 

p= LOD, 911%) 

fQx,4) g(x, |X) 


Note 1: Previously, when g(t|x) was assumed to be symmetric, 
9%al%) _| 
G(X; | X)4) 


Note 2: To calculate p, the best strategy is to let 
p = exp(q) 
after first computing 


q=log f(x,)—log f(x.) 
+log ges | 20) —log g(x! | X; 4). 
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Exercise 6.5 A Metropolis algorithm with a non-symmetric 
driver 


Generate a random sample of size 10,000 from the distribution defined by 
the pdf 


Ds 0<x<1 


f(x) = 


T. ve 
—e x>1 


using the Metropolis algorithm and a non-symmetric driver with density 
of the form 
ódpédg 
g(t|x) — focs (D = TG) ' t0, 
or equivalently, a driver defined by 
(t| x) G(xő, ô). 


Check your results by plotting a probability histogram of the sample 
values and overlaying the target density, f(x). Also discuss why this 
driver is suitable in this situation. 


Solution to Exercise 6.5 


At each iteration j the proposed value is generated by sampling 
x, ~ G(x 6,0). 


The rationale for this choice of driver is that the proposed value is 
certainly positive, it has: 
mean x, 09/0 =X; 


variance x, 0/6  — x; ,/6. 


Thus the candidate x is guaranteed to be in the appropriate range ( 9X" ), 


and it is centred at the last value ( x; , ). 


Also, its variance around that last value is proportional to it (by a factor 
of 1/6). This ensures that values near zero are appropriately ‘explored’ 
by the Markov chain. 
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With this driver, the acceptance probability at iteration j is 


p - exp(q), 
where: 


q —log f (x) —log f (x; ,) -logg(x; , | x;)—log g(x; | x, ,). 
log f (x)= I(0 < x <1){—0.5log x — log 4) -- I(x > 1)(1— x — log 2} 
log g(t | x) = xólogó + (xó — 1)logt — xô — logIT (x6). 


Even with this use of the logarithmic function, computational issues arose 
in R on account of limitations with the functions rgamma() and lgamma(). 
These limitations are acknowledged in the help files for these functions 
in R. 


To give an example: 
set.seed(321) 
v - rgamma(10000,0.001,0.001) 
# Large sample from the G(0.001,0.001) distribution. 

mean(v) # 0.5827886 

# This is clearly wrong since the mean is 0.001/0.001 = 1. 
length(v[v==0]) # 4777 

# Almost HALF of the values are EXACTLY zero. 


The R code was appropriately modified so that whenever very small but 
non-zero values were reported as zero by R (and problems ensued or 
potentially ensued because of this) those values were changed in the code 
to 5e-324 (the smallest representable non-zero number in R). 


With the above specification and fixes, the Metropolis algorithm was run 
for 10,000 iterations following a burn-in of size 100 and starting at 1. The 
value of ô used was 1.3 and this resulted in an acceptance rate of 53% as 
well as good mixing. Figure 6.11 shows the resulting trace of all 10,101 
values of x, and Figure 6.12 shows the required probability histogram of 
the last 10,000 values, together with the exact density f(x) overlaid. 


Note: Applying a gamma driver here (in an attempt to improve the 
‘vanilla’ version of the Metropolis algorithm) created problems, due to 
numerical issues in R associated with the gamma distribution. With 
some modifications, we were in the end able to make things work. 
Another choice of nonsymmetric driver distribution is the lognormal, 
and we leave it as an additional exercise to examine this option in detail. 
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Figure 6.11 Trace of simulated values 
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Figure 6.12 Histogram and true density 
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R Code for Exercise 6.5 


set.seed(321); v = rgamma(10000,0.001,0.001) 
# Large sample from the G(0.001,0.001) distribution. 
mean(v) # 0.5827886 This is clearly wrong since the mean is 0.001/0.001 = 1. 
length(v[v==0]) # 4777 Almost HALF of the values are EXACTLY zero. 
logffun=function(x){ res=-0.5*log(x)-log(4); if(x>1) res=1-x-log(2); res } 
loggfun=function(t,x,del){ 
x*del*log(del)+(x*del-1)*log(t)-t*del-lgamma(max( x*del, 5e-324 )) } 
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MET <- function(K,x,del)( & This function implements a simple Metropolis alg. 
# Inputs: K = total number of iterations, x = starting value of x, 
# del = tuning constant in driver 
# Outputs: Sxv = vector of x-values of length (K + 1), Sar = acceptance rate 
Xv=x; ct=0 
for(j in 1:K)| xp =max( rgamma(1,x*del,del), 5e-324 ) 
logp = logffun(x=xp) - logffun(x=x) + 
loggfun(t=x,x=xp,del=del) - loggfun(t=xp,x=x, del=del) 
p=exp(logp); u-=runif(1);  if(u<p){x =xp; ct=ct+1} 
xv = c(xv,x) } 
list(xv=xv,ar=ct/K) } 


X11(w=8,h=4.5); par(mfrow=c(1,1)); K = 10100; 

set.seed(319); res = MET(K=K,x=1,del=1.3); resSar #0.5324752 

plot(0:K,resSxv,type="I",xlab="j",ylab="x_j") 

xv <- resSxv[-c(1:101)] 

hist(xv,xlab="x", prob=T,ylim=c(0,2.5),xlim=c(0,5), ylab="density", main="", 
breaks=seq(0,20,0.05) ) 

xvec=seq(0,10,0.001); fvec=xvec; 

for(i in 1:length(xvec)) fvec[i]|zexp(logffun(xvec[i])) 

lines(xvec,fvec,lwd=2) 


summary(resSxv) 
# Min. 1stQu. Median Mean 3rdQu. Max. 
# 0.004243 0.309400 1.034000 1.218000 1.738000 9.356000 (OK, as Min > 0) 


6.6 The Metropolis-Hastings algorithm 


We have introduced Markov chain Monte Carlo methods with a detailed 
discussion of the Metropolis algorithm. As already noted, this algorithm 
is limited and rarely used on its own because it can only be used to sample 
from univariate distributions. Typically, other methods will be better 
suited to the task of sampling from a univariate distribution. 


We now turn to the Metropolis-Hastings (MH) algorithm, a generalisation 
of the Metropolis algorithm that can be used to sample from a very wide 
range of multivariate distributions. This algorithm is very useful and has 
been applied in many difficult statistical modelling settings. 
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First let us again review the Metropolis algorithm for sampling from a 
univariate density, f(x). This involves choosing an arbitrary starting 
value of x, a suitable driver density g(t|x) and then repeatedly proposing 
a value x’ ~ g(t| x), each time accepting this value with probability 
p= [60,9613 
fG) glx) 
DIT "T 
(or p= fo) in the case of a symmetric driver). 
x 


Each proposal and then either acceptance or rejection constitutes one 
iteration of the algorithm and may be referred to as a Metropolis step. 


Performing K iterations, each consisting of a single Metropolis step, 


results in a Markov chain of values which may be denoted x”, x”,...,x?. 


Assuming that stochastic equilibrium has been attained within B iterations 
(B standing for burn-in) the last J = K — B values may be renumbered so 


as to yield the required sample, x°”,...,x”? ^ iid f(x). 


The Metropolis-Hastings (MH) algorithm is a generalisation of this 
procedure to the case where x is a vector of length M (say) . 


The bivariate MH algorithm 


For simplicity we will first focus on the bivariate case (M = 2). Thus, 
suppose we wish to generate a random sample from the distribution of a 
random vector X —(X,, X,) with pdf f(x), where x =(x,,x,) denotes 
a value of X. 


First, choose an initial value of x = (x, x;). 


Then choose two suitable driver distributions or densities: 
g,(t | X,,X) 
git 25,25). 


Next perform the following two Metropolis steps: 
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1. Propose a candidate value of x, by sampling 
x is g,(t | Xi X3) ? 
and accept this value with probability 
f o ue 1056) 
fex) | g, G3 3 x;) 


(In the case of an acceptance, let x, — x, 


pi = 


and otherwise leave x, unchanged.) 


2. Propose a candidate value of x, by sampling 
X dr g(t | Xis X2), 
and accept this value with probability 
-FL ula) 
foo A) gas 1%) 


(In the case of an acceptance, let x, = x, 


D» 


and otherwise leave x, unchanged.) 
This completes the first iteration of the MH algorithm. 


The initial value of x = (x,, x;,) may be denoted 
SOP un 
and the current value of the Markov chain may be denoted 
x e ax. 
Performing another iteration of the MH algorithm as above (starting from 
x = x) leads to the next value, 
x = (x, x), 


and so on. 


Continuing in this fashion results in a Markov chain of vectors, 


xO x x9. 


Assuming that stochastic equilibrium has been attained within B iterations, 
the last J = K — B vectors may be renumbered consecutively to yield the 
required sample, 


x5, ox wig f(x), 


where x? = (x, xf), 


292 


Chapter 6: MCMC Methods Part | 


Note 1: This multivariate sample can then be used to perform marginal 
inferences. For example, by discarding all the x? values, we obtain a 
sample from the marginal posterior distribution of x, , namely 


xc id | E 


This technique would be useful if obtaining a sample from f(x) 


directly were for any reason problematic. For example, the marginal 
density 


f(x) | f Gu 25)dx, 


might be difficult to derive explicitly or sample from. 


Note 2: Observe that 
fe oy fr) hc) 
= , etc. 
i Caley es E. 


Thus the two acceptance probabilities could also be written as: 
Eee be 

f (5.3;) Gi(% |XX) 
= f OS o) x g;(X; | X, x3) 

fG5,x) g,051x,x,)- 


Di 


D» 


The trivariate MH algorithm 


The Metropolis-Hastings algorithm for sampling from the trivariate 
distribution (M = 3) of a vector random variable X —(X,, X,, X,) 
involves choosing an initial value of the vector 

X = (XX), X3), 
specifying three driver densities: 

gt |X; XX3) 

g (t| Xp X3; X3) 

gait |X p XX), 
and repeatedly iterating three Metropolis steps as follows: 
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1. Propose a candidate value of x, by sampling 
x 9 (E xx. 
and accept this value with probability 
f(x | Xz, X3) x g;Qq | x: X», X3) 


f(x | x, x4) g(x | Xp X5, X4) 


p= 


2. Propose a candidate value of x, by sampling 
x m g(t | Xis X, X3) , 
and accept this value with probability 
— fd xx). gio Dx) 


Pp š 
] f(x | x; x3) quo | Xy, X5, X4) 


3. Propose a candidate value of x, by sampling 
X ~ g(t | X25, X), 
and accept this value with probability 
= f(x LATLARERCS LL 


= : 
f (X3 |XX) 9506 | Xi» X2» X3) 


As before, continuing in this fashion until stochastic equilibrium has been 
achieved, and then for another J iterations, leads to the random sample 


x... x? ^ iid f(x), where now x? = (xU*, x9, x), 
Note: As before, the x? values on their own then constitute a sample 
from the marginal distribution of x, , whose density is now 
f(x)s If hioc dodo 
and the three acceptance probabilities can also be expressed as 
/ / 
NICE 


Pı 7 , 
EE X X4) g. X, Io x) 


etc: 


The general MH algorithm 


These ideas extend naturally and in an obvious fashion to higher values 
of M. Thus, for sampling from an M-variate distribution with density 
f (X,,..., Xy) , the MH algorithm involves choosing a starting value 


X= (Xps Xah 
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specifying M drivers, 
OAT Xp) (mne... M y 
and repeatedly iterating M steps as follows: 


1. Propose a candidate value of x, by sampling 
x = g,(t | Xise X) , 
and accept this value with probability 
MUCIUS, 


p 
" dO Ex. g(x Dose.) 


2. Propose a candidate value of x, by sampling 
x ~ g(t | Xise Xu) , 
and accept this value with probability 
/ / 
"v OAD We cree. eo, - go(Xs | Xo Xis Xasis X) 
2 / 
f (x, | X Xs Xy) g;(x; | X X5, X3, Xy) 


M. Propose a candidate value of x,, by sampling 
Xy ~ gu (t Xy... Xy), 
and accept this value with probability 
- FG | X, Xy 1) gu (Xy Rip ei) 


f (Xu [Mise Xia) Qu. | Xi Xma Xy) 


Pu 


As before, continuing in this fashion until stochastic equilibrium and then 
for J more iterations leads to the sample x/?,..., x ~ iid f(x), where 
now x (ah xD. 

Note: Again, the x values on their own then constitute a sample from 


the marginal distribution of x, , whose density is now 


f(x) | Qs o dissi, X 
and the M acceptance probabilities can also be expressed as 
uites tentabat due - 
ere er Casar n 


Di 
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Exercise 6.6 MH algorithm applied to a bent coin which is 
tossed an unknown number of times 


Suppose that five heads have come up on an unknown number of tosses 
of a bent coin. 


Before the experiment, we believed the coin was going to be tossed a 
number of times equal to 1, 2, 3, ..., or 9, with all possibilities equally 
likely. As regards the probability of heads coming up on a single toss, we 
deemed no value more or less likely than any other value. We also 
considered the probability of heads as unrelated to the number of tosses. 


Find the marginal posterior distribution and mean of the number of tosses 
and of the probability of heads, respectively. Also find the number of 
heads we can expect to come up if the coin is tossed again the same 
number of times. 


Do all this via Monte Carlo by designing and implementing a suitable MH 
algorithm. 


Note: This problem was solved analytically in Exercise 3.10. 


Solution to Exercise 6.6 


As in Exercise 3.10, the relevant Bayesian model is: 
(y |0,n) ~ Binomial(n,0) 
(0 | n) - U(0,1) 
f(n)=1/k, n=1,...,.k, k=9, 
and the joint posterior density of the two parameters n and @ is 
f (n,0| y) x f(n)f (0| n) f(y |n, 4) 
" nig^(1—0) 7 
(n— y)! 
=h(n,~), 0«0«1, n=y,y+1....,k. 


Let us now specify the driver for n as discrete uniform over the integers 
from n—r to n+r, where r is a tuning parameter. 


Also let the driver for 0 be uniform from 0—c to 0--c, where c is 
another tuning parameter. 
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Note: These drivers may also be expressed by writing the distributions 
explicitly as: 

n «DU(n—r,n—r-F..,n--r) 

0 «U(0—c,0--c), 
or by writing the driver densities explicitly as: 


1 
t|n,Q9)  ——, t=n-r,n=r+1,..,n+r 
g,(t|n,0) ids 


stt ms 0—c«t«0-c. 
Da 


Noting that both drivers are symmetric, a suitable MH algorithm may be 
defined by the following two steps at each iteration: 


1. Propose a value 
n'~ DU(n—r,....n+r), 
and accept this value with probability 
— f(n,0|y) h(n',0) n'6"(—0)"" /(n'— y)! 
' o f(n8|y) h(n0)  nw!?1—6)""/(n— y)! 
— n'(1—0)" /(n'— y)! 
— n!1—0) /(n— y) ` 


2. Propose a value 
0' -U(0—c,0--c), 
and accept this value with probability 
— f(n60'|y) h(n6)  n!0"(1—0^)""/(n— y)! 
^^ f(n0|y) h(n0)  n!0'(1—0)"" / (n— y)! 
| 0"ü0—0h"» 
X eae | 


Note: The proposed value n' should automatically be rejected if it is 
outside the set {5,...,9} (because then f(n',0| y) = 0), and otherwise 
automatically accepted if p, >1.If n’ =nthen p, = 1, again leading to 
automatic acceptance. 


Likewise, the proposed value 0^ should be automatically rejected if it 
is outside the interval (0,1), and otherwise automatically accepted if 


pus 
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Setting c = 0.3 and r = 1 (after some experimentation) and starting from 
n = 7 and 0 = 0.5, the MH algorithm converged very quickly, with 
acceptance rates of 7396 for n and 5896 for 0 over a total of 10,100 
iterations. 


The first 100 iterations were thrown away as the burn-in, and then every 
20th value (only) was recorded so as to thereby yield an approximately 
random sample of size J = 500 from the joint posterior distribution of n 
and 0 , namely (n,,0),...,(n,,0,) ~ iid f (n,0| y). 


Figures 6.13 and 6.14 (pages 299 and 300) show the traces for all 10,101 
values of n and 0 , respectively, and Figures 6.15 and 6.16 (pages 300 
and 301) show the traces for the final 500 values of nand 0 , respectively. 


Figure 6.17 (page 301) shows the corresponding sample ACFs 
(autocorrelation functions), labelled nvO and thvO for the last 10,000 
values of n and 0 , respectively, and labelled nv and thv for the final 500 
values of n and 0 . The thinning process has dramatically reduced the high 
serial correlation. 


The final bivariate sample of size J = 500 was used for Monte Carlo 
inference in the usual way, with the following results. 


The MC estimate of n = E(n| y) (= 6.744) was n = 6.708, with 95% CI 
(6.587, 6.829). 


The Monte Carlo estimate of 0 = E(0 | y) (= 0.7040) was 0 = 0.7097, 
with 9596 CI (0.6943, 0.7252). Also, the 95% CPDR estimate for 0 was 
(0.3547, 0.9886). 


Figure 6.18 (page 302) is a probability histogram of the almost random 
sample n,,...,n, ^ iid f(n |y), and Figure 6.19 (page 302) is a probability 
histogram of the almost random sample 0,...,0, ^ iid f(0 |y). 


Each histogram is overlaid with a nonparametric density estimate based 
on the histogram, as well as with the true marginal posterior density. 


Each histogram also includes vertical lines showing the true distribution 
mean, the MC estimate of that mean, and the 9596 CI for that mean. 


Figure 6.19 also displays the 95% CPDR estimate for 0 . 
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Note 1: The histogram of n-values in Figure 6.18 (page 302) is itself an 
estimate of f(n|y). The short vertical lines in the histogram indicate 


the MC 95% CIs for f (n| y). 


For example, the height of the bar above 6 is the proportion of sample 
values n,..,n, equal to 6, which is 117/500 = 0.234, and the short 


vertical bar above 6 is the MC 9596 CI for P(n=6| y), which is 


(0.2342:1.96./0.234(1— 0.234) / 500) = (0.1969, 0.2711). 


Note 2: The histogram of @-values in Figure 6.19 (page 302) in fact 
shows two posterior density estimates. The first and simplest estimate 
tapers towards zero as @ approaches 1. The second estimate was 
obtained using a special mathematical device that was applied so as to 
‘force’ the density estimate to be relatively high near 1. For values of 8 
less than about 0.8, the two density estimates are virtually identical. 
Details of said mathematical device can be found in the R code below. 


Figure 6.13 Trace of 10,101 n-values 
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Figure 6.14 Trace of 10,101 0-values 
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Figure 6.15 Trace of 500 n-values 
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Figure 6.16 Trace of 500 0-values 
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Figure 6.17 Sample autocorrelation functions 
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Figure 6.18 Probability histogram of 500 n-values 
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Figure 6.19 Probability histogram of 500 0 -values 
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R Code for Exercise 6.6 


# NB: Some of this R Code was copied from a previous exercise 


y <- 5; k <- 9; options(digits=4) 

nvec <-y:k; avec <- 1/(nvec+1); sumavec <- sum(avec); sumavec #0.6456 
fny <- avec/sumavec;  rbind(nvec,avec/fny) 

# nvec 5.0000 6.0000 7.0000 8.0000 9.0000 

# avec 0.1667 0.1429 0.1250 0.1111 0.1000 

# fny 0.2581 0.2213 0.1936 0.1721 0.1549 


nhat <- sum(nvec*fny); nhat 4 6.744 

thhat <- sum( fny * (y+1)/(nvect2) ); thhat # 0.704 

xhat <- sum( fny * nvec * (y+1)/(nvect2) ); xhat # 4.592 

thvec «- seq(0,0.99,0.01); fthyvec «- thvec 

for(i in 1:length(thvec)) fthyvec[i] «- sum( fny * dbeta(thvec[i],y+1,nvec-y+1) ) 


X11(w=8,h=6); par(mfrowzc(2,1)) 
plot(nvec,fny,typez"n",xlabz"n",ylabz"f(n |y)", ylim=c(0,0.4)) 
points(nvec,fny,pch=16,cex=1); abline(vznhat) 
plot(thvec,fthyvec,typez"n",xlabz"theta",ylabz"f(theta | y) ",ylimzc(0,2.5)) 
lines(thvec,fthyvec,lwd=3); abline(v-thhat) 


# Code for Metropolis-Hastings algorithm -------------------------------------------- 
MH = function(Jp,n,th,c,r,y,k){ 
# This function performs the Metropolis-Hastings algorithm for a simple model. 


# Inputs: Jp = total number of iterations 

# n, th = intial values of n and theta 

# r, c= tuning parameters for n and theta 

H y, k number of successes, maximum value of n 
# Outputs: Snvec = vector of (Jp+1) values of n 

# Sthvec = vector of (Jp+1) values of theta 

# Snar, Sthar = acceptance rates for n and theta. 


nvec = n; thvec = th; nct = 0; thct = 0 
logfun = function(n,th,y){ # Calculates the log of the joint posterior kernel 
Igamma(n+1) + y*log(th) + (n-y)*log(1-th) - lgamma(n-y+1) } 
for(j in 1:Jp){ 
nprop = sample((n-r):(n+r),1) 
if(nprop >= y) if(nprop <= k){ 
ifínprop ==n) nct=nct+1 
if(nprop !=n){ 
logp1 = logfun(n=nprop,th=th,y=y) - logfun(n=n,th=th,y=y) 
p1 =exp(logp1); u<- runif(1) 
if(u < p1){ n = nprop; nct = nct + 1} 
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) 
} 
thprop = runif(1,th-c,th+c) 
if(thprop > 0) if(thprop < 1) 
logp2 = logfun(n=n,th=thprop,y=y) - logfun(n=n,th=th,y=y) 
p2 = exp(logp2); u= runif(1) 
if(u < p2){ th = thprop; thct = thct + 1} 
} 
nvec = c(nvec,n); thvec = c(thvec,th) 
} 
nar = nct/Jp; thar = thct/Jp; _list(nvec=nvec,thvec=thvec,nar=nar,thar=thar) } 
# END 


X11(w=8,h=5); par(mfrow=c(1,1)) 

Jp = 10100; set.seed(135); res = MH(Jp=Jp,n=7,th=0.5,c=0.3,r=1,y=5,k=9) 
c(resSnar,resSthar) # 0.7344 0.5847 

"I" xlabz"j",ylabz"n j") 

"", xlab="j",ylab="theta_j") 


plot(0:Jp,res$nvec,type- 
plot(0:Jp,resSthvec,type- 


burn = 100; nvO = resSnvec[-(1:(burn+1))]; thvO = resSthvec[-(1:(burn+1))] 
nvznvO[seq(20,10000,20)]; thv=thvO[seq(20,10000,20)]; J=500 

aa la xlab="j",ylab="n_j") 

"I", xlab="j",ylab="theta_j") 


plot(1:J,nv,type= 
plot(1:J,thv,type= 


par(mfrow=c(2,2));acf(nvO); acf(thvO); acf(nv); acf(thv) 


nbar = mean(nv); nci = nbar + c(-1,1)*qnorm(0.975)*sd(nv)/sqrt(J) 
c(nbar,nci) # 6.708 6.587 6.829 

thbar = mean(thv); thci = thbar + c(-1,1)*qnorm(0.975)*sd(thv)/sqrt(J) 
thcpdr = quantile(thv,c(0.025,0.975)) 

c(thbar,thci,thcpdr) # 0.7097 0.6943 0.7252 0.3547 0.9886 


nvals=5:9; fvalszsummary(as.factor(nv)); pvals=fvals/J 
Lvals-pvals-qnorm(0.975)*sqrt(pvals*(1-pvals)/J) 
Uvals=pvals+qnorm(0.975)*sqrt(pvals*(1-pvals)/J) 


rbind(nvals,fvals,pvals,Lvals,Uvals) 

4 nvals 5.0000 6.0000 7.0000 8.0000 9.0000 

# fvals 128.0000 117.0000 98.0000 87.0000 70.0000 
4 pvals 0.2560 0.2340 0.1960 0.1740 0.1400 
#Lvals 0.2177 0.1969 0.1612 0.1408 0.1096 

# Uvals 0.2943 0.2711 0.2308 0.2072 0.1704 
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par(mfrowzc(1,1)) 

hist(nv,prob=T,xlim=c(4,10),ylim=c(0,0.5),xlab="n", breaks=seq(4.5,9.5,1), 
main="", ylab="density") 

points(nvec,fny,pch=16); abline(v=nhat) 

for(i in 1:length(nvals)) lines(rep(nvals[i],2),c(Lvals[i], Uvals[i]),|wdz2) 

abline(v=nbar,lty=4); abline(v2nci,Ityz2) 

legend(8,0.5,c("True mean" "Estimate of mean","95% CI for mean"),Ityzc(1,4,2)) 

legend(4,0.5,c("True posterior"), pch=16,cex=1) 

legend(4,0.4,c(" 9596 CI for f(n | y)"),Ityz1,Iwdz2) 


hist(thv, prob=T,xlim=c(0,1),ylim=c(0,3.2),xlab="theta", 
main="", ylab="density") 
lines(thvec,fthyvec,lwd=2); abline(v-thhat) 
thdensity <- density( c(thv,1+abs(1-thv)), from=0, to=1,width=0.2) 
lines(density(thv,from=0,to=1,width=0.2), Ity=2,lwd=2) 
# Note: This is the simplest way to estimate the density 
lines(thdensitySx,thdensitySy*2,|ty=3,lwd=2) 

# Note: This density estimate is forced to be higher at theta=1 
abline(v=thbar, lty=4); abline(v-thci,Ityz2); abline(v-thcpdr,Ityz3) 
legend(0,3.2,c("True mean","Estimate of mean","95% Cl for mean", 

"95% CPDR estimate"), |ty=c(1,4,2,3)) 
legend(0,1.6,c("True posterior","Estimate 1","Estimate 2"),lty=c(1,2,3),lwd=2) 


6.7 Independence drivers and block sampling 


The Metropolis-Hastings algorithm is very flexible and allows for a lot of 
choice in the way it is designed. In any particular application, many 
different MH algorithms will work, but some may perform better than 
others, meaning they will result in better mixing and faster convergence 
towards stochastic equilibrium. This will have a lot to do with how the 
random variables involved are set up and parameterised, what driver 
distributions are specified, and which tuning parameters are then chosen 
for completely defining those driver distributions. 


For example, the driver distribution for a component x,, of the vector 
X = (X,,...,X,,) may be chosen so that it depends only on the last value of 


itself. In that case, g, (t| x,, X,,..., X,,) can also be written g, (t | x, ). 


In fact, this is the norm in practice, and it was the case for both drivers in 
the last exercise. 
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It is also permissible to choose the mth driver so that it doesn't depend on 
any of the current values of the Markov chain, including itself. In that case, 


the driver g, (t| x, x,,..., X,,) may be written g, (t) and be referred to as 
an independence driver. 


Also, one may ‘bundle’ any of the M random variables into blocks and 
thereby reduce the number of actual Metropolis steps per iteration. For 


example, instead of doing a Metropolis step for each of x, and x, at each 
iteration, one may do a single Metropolis step as follows: 


Create a candidate value of (x,, x,) by sampling 
(n xj ~ ga, (Gu | Xs, x,) (say), 
and then accept this candidate (x, x;) with probability 
_ CH ote We Xh, Xss Xy) 
f (Xo 35,3630 3 31.) 
" uc. | Xp Xos Xis Xis Xese Xy) 


/ / z 
Jay G 36999653 6 250] 


P34 


This idea can be used to improve mixing and speed up the rate of 
convergence but may require more work sampling from the bivariate 
driver and determining the optimal tuning constant. Note that to sample 
(xj, x1), it may be possible to do this in two steps via the method of 
composition according to 

gs (G5u|X,,x;) = 93(t | X3; x,)g4s(u | xs, x,,t). 


6.8 Gibbs steps and the Gibbs sampler 


One important possibility is to give the driver for x, exactly the same 


distribution as the conditional distribution of x,, given all the other values. 


In that case, the proposal density is 
FEl m eM =F Rt o JE A u) 


With this choice, the acceptance probability equals 
/ 
_ OX | Rte obe d) : F d me JE ve 
PON endo sna FA aed E omis 
=1 (that is, 100%). 


m 
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This means that the candidate value x’ is definitely accepted at every 


iteration. In that case we call the mth step of the Metropolis-Hastings 
algorithm a Gibbs step. 


If all the Metropolis steps are Gibbs steps then the algorithm may also be 
called a Gibbs sampler. 


Note: In the case M = 1, the Gibbs sampler equates to sampling directly 
from the distribution of interest, with no stochastic dependence between 
values of the resulting chain. 


Thus a Gibbs sampler for sampling from a multivariate distribution 
f(x) = fx... Xy) 
may be defined as iteratively sampling from the full conditional densities: 
f(x | X»; X35. Xy) 
f (x, | X, X3, Xu) 
Obs 1965535 536 62: 
where each of these is proportional to f (x,,..., x,,) , for example, where 
T f ux xu) 
M 


[Xavi yg) 


x 
v gp Ae UNO NUR 


epo mE: 


Note: We could also write the mth conditional density as 
nm | X n) 5 
where 


X m = (X). + Xn 1? Xm 2250) 


denotes the vector x with the s component removed. 


In any case, the mth distribution can be obtained by examining the joint 
density of all the variables seeing that joint density as a density function 
of only x, . 


An advantage of the Gibbs sampler is that it produces ‘good mixing’, on 
account of no ‘wastage’ due to rejections. A disadvantage is that sampling 
from all the required exact conditional distributions may not be easy or 
even possible. 
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The Metropolis-Hastings algorithm is a very versatile tool that will work 
in almost every situation with the least amount of mathematical effort. 
The Gibbs sampler performs better but is practically feasible only in some 
special cases. 


A general recommendation in any given situation is to begin by specifying 
a ‘pure’ Metropolis-Hastings algorithm, and then to examine each of its 
M Metropolis steps with a view to converting it into a Gibbs step if that is 
not too much effort. If the resulting Metropolis-Hastings algorithm 
consists of at least one Gibbs step and at least one Metropolis step, it may 
also be referred to as a Metropolis-Hastings within Gibbs sampler. 


Example 


As an example of converting a Metropolis step into a Gibbs step, recall 
the joint posterior density in Exercise 6.5: 


RTT , 0«0«1, n= y,y+1,...k. 
— y): 


This density was used as a basis for the following Metropolis step for 0 
at each iteration: 


2. Proposea value 0'- U(0 — c,0 4- c) and accept this value with 
0” ‘al = ey” 


robabilit = : 
p lity p, Pao? 


Instead of this Metropolis step at each iteration, it would be better and also 
easier to apply a Gibbs step which involves sampling the next value of 0 
directly from the Beta(y +1,n — y +1) distribution. 


Equivalently, one could write that Gibbs step as: 
2. Draw 0 ~ Beta(y+1,n—y+1). 
Now consider the Metropolis step for n in Exercise 6.5: 


1. Propose a value n' ~ DU(n-—r,...,b +r), and accept this value 
f(n,0|y) n'(1—0)" /(n' — y)! 
f(n0|y)  n'1-0)/(n-y) - 


with probability p, — 
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Unfortunately, the kernel of f (n,0| y) when seen as a function of n alone 


(ie. n!(1—0) /(n— y)!) does not suggest a well-known distribution. 


However, with a little effort, it is still possible to convert the Metropolis 
step for n into a Gibbs step, as follows: 


1. Calculate q(n) 2 n!(1—0)" / (n— y)! for each n = 5,...,9. 
Calculate q, = q(5)+...+q(9). 
Hence obtain f(n| y,0) = q(n)/q, . 
Draw n ^ f (n| y,0) (now easy). 


Exercise 6.7 Sampling from a normal-normal-gamma model via 
MCMC 


Consider the general normal-normal-gamma model given by: 
Qnis Yn |44) ~ iid Nu, A) 
(uA) ~ NGA, 0;) 
A~G(a, p). 


Suppose that 4j, =10, o, =2, a =3, B =6andn = 40. 
(a) Generate y = (y,,..., y,) from the model using these constants. 


(b) Design à suitable Metropolis-Hastings algorithm in this setting. Then 
apply it y in (a) so as to generate a random sample of size J = 5,000 from 
the bivariate posterior distribution of 1 and 4 . Illustrate the sample with 
appropriate trace plots and probability histograms. 


(c) Repeat (b) but with a Gibbs sampler in place of the MH algorithm. 


Solution to Exercise 6.7 


(a) Using the specified values, we generated the parameters 
A — 0.1292 and w = 11.95 
from their independent prior distributions. 


We then generated n = 40 values from the N(u,0^) distribution with 


o =1/ Ja = 2.782. The sample mean and standard deviation of these 
values were 12.28 and 2.592. A histogram of the sample values is shown 
in Figure 6.20. Overlaid is the N(u,o°) density. 
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Figure 6.20 Probability histogram of 40 y-values 
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(b) The joint posterior density of u and A is 
f (|y) cc fO) fu A) f Qr | i) 
uy n 


209 x] e 


i=1 


M-m? -Y0 


xA Te xe 


1 


od"? exp —AB8 -— 
205 


=k(p,). 
A suitable MH algorithm is then defined by the following two steps: 


1. Draw a value pi’ - U(u—c,u-- c) 
k(u ^, A) 
k(u, à) 


and accept it with probability p, = 
2. Draw a value A' ^U(A—r, A4 r) 


/ 
and accept it with probability p, — ey 
Ls 


Note: The best way to calculate the acceptance probabilities is as: 
p, = exp(q,) and p, =exp(q,), 
after first deriving q, = 1(4/, 4) - I(u, A) and q, =1(u,2')-I(u, A), 
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where l(j,AÀ) = logk(, A) 
il 
Der 


n Ax 
a- rioga -a8 (ai -52 O. 
i-l 


The MH algorithm was started at 44 = 0 and À = 1 with tuning constants 


c 7 0.1 and r = 0.01, and run for a total of 6,000 iterations. The resulting 
traces are shown in Figures 6.21 and 6.22. 


The acceptance rates for 4 and A were 92% and 92%. These rates were 


judged to be unduly high because they led to very strong serial correlation 
in the simulated values (i.e. poor mixing). 


So the algorithm was run again from the same starting values but with 
c = 0.9 and r = 0.08 (both larger). This resulted in Figures 6.23 and 6.24 
(pages 312 and 313), with much better mixing, faster convergence, and 
the better acceptance rates of 5996 and 5896. 


The last 5,000 pairs of values from this second run of the algorithm were 
then collected and used to produce the two histograms in Figures 6.25 and 
6.26 (pages 313 and 314). Each histogram is overlaid by a density estimate 
of the corresponding posterior and shows a dot indicating the true value 
of the parameter (which was initially sampled from its prior). 


Figure 6.21 Trace for 1 
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Figure 6.22 Trace for ^ 
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Figure 6.23 Improved trace for 4 
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Figure 6.24 Improved trace for à 
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Figure 6.25 Histogram for 1 
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Figure 6.26 Histogram for A 
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(c) Examining the kernel of the joint posterior in (b) and studying previous 
exercises (involving the normal-normal model and the normal-gamma 
model) we easily identify the two conditional distributions which define 
the Gibbs sampler. These are defined as follows: 
1. Sample y~ f(u| y, 3) * N(u.,0), where: u. =(1—k) up - ky , 
2 
gu uU qo o us ue mda. 
n nà nt+o°/o, n-c(01/(4o$)) 


2. Sample A~ f(A|y,u) ^ G 


a ez (n- Ds enu- yy). 


This Gibbs sampler was started at 44 = 0 and \ = 1 and run for a total of 
6,000 iterations. The resulting traces are shown in Figures 6.27 and 6.28. 


The last 5,000 pairs of values were then collected and used to produce the 
histograms in Figures 6.29 and 6.30 (page 316). Each histogram is 
overlaid by a density estimate of the corresponding posterior and shows a 
dot indicating the true value of the parameter. 


We see that the Gibbs sampler has produced very similar output to that in 
(b) as obtained using the Metropolis-Hastings algorithm, but with less 
effort (e.g. no need to worry about tuning constants) and with arguably 
better results. 
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By this we mean that the output from the Gibbs sampler exhibits far less 
serial correlation. This is evidenced clearly in Figure 6.31 (page 317), 
which shows the sample autocorrelation functions of the simulated values 
of u and A in (b) (top two subplots) and in (c) (bottom two subplots). 


Figure 6.27 Trace for u from Gibbs sampler 
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Figure 6.28 Trace for \ from Gibbs sampler 
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Figure 6.29 Histogram for u from Gibbs sampler 
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Figure 6.30 Histogram for ^ from Gibbs sampler 
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Figure 6.31 Sample autocorrelations 


Series muvb Series lamvb 


E: E: 
e e 
Lu Lu 
o o 
X cx X * 
c e 
o o 
e e 
0 5 #10 15 20 25 30 35 0 5 10 15 20 25 30 35 
Lag Lag 
Series muvc Series lamvc 
Lu Lu 
o o 
E E 
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 
Lag Lag 


R Code for Exercise 6.7 


# (a) 
mu0=10; sigO=2; alp=3; bet=6; n=40; options(digits=4) 
set.seed(226); lam=rgamma(1,alp,bet); mu=rnorm(1,mu0,sig0); 
sig=1/sqrt(lam); y=rnorm(n,mu,sig) 
c(lam, sig, sig^2, mu, mean(y), sd(y)) 

#0.1292 2.7822 7.7405 11.9511 12.2768 2.5919 


X11(w=8,h=5); par(mfrow=c(1,1)) 


hist(y, prob=T,xlim=c(5,20),ylim=c(0,0.25),breaks=seq(7,17,0.5), main=" ") 
yv=seq(0,20,0.01); lines(yv, dnorm(yv,mu,sig),lwd=3) 
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# (b) 

MH <- function(Jp, mu, lam, y, c,r, alp=0, betz0, muO0z0, sig0=10000 )1 
# This function implements a Metropolis-Hastings algorithm for the general 
4 normal-normal-gamma model. 


H Inputs: Jp = total number of iterations 

H mu, lam = starting values of mu and lambda 

H y - vector of n observations 

H c, r= tuning parameters for mu and lambda 

# alp, bet = parameters of lambda’s gamma prior (mean = alp/bet) 
# muO, sigO = mean and standard deviation of mu's normal prior 
# Outputs: Smuv, Slamv = (Jp+1)-vectors of values of mu and lambda 

# Smuar, Slamar = acceptance rates for mu and lambda. 


muv <- mu; lamv «- lam; ybar «- mean(y); n «- length(y); muct <- 0; lamct «- 0 
logpost <- function(n,y,mu,lam,alp,bet, mu0,sigO){ 
(alp + n/2-1)*log(lam) - bet*lam - 
0.5*lam*sum((y-mu)42) -0.5*(mu-mu0)^2/sig0^2 } 
for(j in 1:Jp){ 
mup <- runif(1,mu-c,mut+c)  & propose a value of mu 
qi <- 
logpost(n=n,y=y,mu=mup, lam=lam,alp=alp,bet=bet, mu0=mu0,sigO=sig0)- 
logpost(n=n,y=y,mu=mu — ,lam=lam,alp=alp,bet=bet, muO=mu0,sigO=sig0) 
p1 <- exp(q1) # acceptance probability 
u <- runif(1); if(u« p1)[ mu <- mup; muct «- muct* 1 } 
lamp <- runif(1,lam-r,lam+r) # propose a value of lambda 
if(lamp > O)( # automatically reject if lamp < 0 
q2 «- 
logpost(n=n,y=y,mu=mu,lam=lamp,alp=alp,bet=bet, mu0=mu0,sig0=sig0)- 
logpost(n=n,y=y,mu=mu,lam=lam  ,alpzalp,bet-bet, muOzmuO,sigO-sigO) 
p2«-exp(q2) # acceptance probability 
u <- runif(1); if(u < p2)( lam <- lamp; lamct «- lamct* 1 } 


} 
muv <- c(muv,mu); lamv <- c(lamv,lam) 
} 
list(muv=muv,lamv=lamv,muar=muct/Jp,lamar=lamct/Jp) 


} 


Jp <- 6000; set.seed(331) 

res <- MH(JpzJp, mu=0,lam=1, y=y, c=0.1,r=0.01, alp=3,bet=6, 
mu0=10,sig0=2) 

c(res$muar,resSlamar) # 0.9193 0.9165 


plot(0:Jp,resSmuv,type="I",xlab="j",ylab="mu_j"); text(3000,6,"c=0.1, r=0.01") 
plot(0:Jp,resSlamv,type="I",xlab="j",ylab="lambda_j"); 
text(3000, 0.6,"c=0.1, r=0.01") 
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res <- MH(JpzJp, mu=0,lam=1, y=y, c=0.9,r=0.08, alp=3,bet=6, 
mu0=10,sig0=2) 
c(resSmuar,resSlamar) # 0.5890 0.5757 


plot(0:Jp,ressmuv,type="I",xlab="j",ylab="mu_j"); text(3000,6,"c=0.9, r=0.08") 
plot(0:Jp,resSlamv,type="I",xlab="j",ylab="lambda_j"); 
text(3000,0.6,"c=0.9, r=0.08") 


burn <- 1000; muv <- resSmuv[-(1:(burn+1))]; lamv <- resSlamv[-(1:(burn+1))] 
hist(muv,prob-T,xlabz"mu",nclassz20,mainz"", 

ylab="density/relative frequency"); lines(density(muv),Iwd-2); 
points(mu,0,pch=16,cex=1.5) 
hist(lamv, prob=T,xlab="lambda",nclass=20,main="", 

ylab="density/relative frequency"); lines(density(lamv),lwd=2) 
points(lam,0,pch=16,cex=1.5) 


# acf(muv)Sacf[1:5] # 1.0000 0.6452 0.4175 0.2744 0.1770 
# acf(lamv)Sacf[1:5] # 1.0000 0.6641 0.4535 0.3300 0.2419 
muvb= muv; lamvbzlamv # For use later 


# (c) 

GS = function(Jp, mu, lam, y, alp=0, bet=0, mu0-0, sigO=10000 ){ 

# This function implements a Gibbs Sampler for the general normal-normal- 
gamma model. 


H Inputs: Jp = total number of iterations 

# mu, lam = starting values of mu and lambda 

# y = vector of n observations 

# alp, bet = parameters of lambda’s gamma prior (mean = alp/bet) 
# muO, sigO = mean and standard deviation of mu's normal prior 
# Outputs: Smuv, Slamv = (Jp+1)-vectors of values of mu and lambda 


muv = mu; lamv = lam; n = length(y); ybar = mean(y); s2 = var(y); sigO2 = sigO^2 
for(j in 1:Jp){ 
sig2=1/lam; k=n/(n+sig2/sigO2); sig2star-k*sig2/n; 
mustar=(1-k)*mu0+k*ybar 
mu = rnorm(1,mustar,sqrt(sig2star)) 
lam=rgamma( 1, alp40.5*n, bet+0.5*((n-1)*s2+n*(mu-ybar)*2) ) 
muv = c(muv,mu); lamv=c(lamv,lam) } 
list(muv=muv,lamv=lamv) 


} 
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Jp = 6000; set.seed(331) 
res = GS(Jp=Jp, mu=0,lam=1, yzy, alp=3,bet=6, mu0=10,sig0=2) 


plot(0:Jp,res$muv,type-"I",xlabz"j",ylabz"mu j"); 


plot(0:Jp,resSlamv,type="I",xlab="j",ylab="lambda_j"); 


burn <- 1000; muv <- resSmuv[-(1:(burn+1))]; lamv <- resSlamv[-(1:(burn+1))] 


hist(muv,prob=T,xlab="mu",nclass=20,main="",ylim=c(0,1.1), 
ylab="density/relative frequency"); lines(density(muv),Iwd-2); 
points(mu,0,pch=16,cex=1.5) 


hist(lamv, prob=T,xlab="lambda",nclass=20,main="", 
ylab="density/relative frequency"); lines(density(lamv),lwd=2) 

points(lam,0,pch=16,cex=1.5) 

muvc=muv; lamvc=lamv 


X11(w=8,h=7); par(mfrow=c(2,2)) 


acf(muvb)Sacf[1:5] # 1.0000 0.6452 0.4175 0.2744 0.1770 
acf(lamvb)Sacf[1:5] # 1.0000 0.6641 0.4535 0.3300 0.2419 


acf(muvc)Sacf[1:5] # 1.0000000 -0.0004031 0.0079520 -0.0073517 0.0135979 
acf(lamvc)Sacf[1:5] # 1.000000 0.002873 -0.011504 -0.006671 -0.001769 
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7.1 Introduction 


In the last chapter we introduced a set of very powerful tools for 
generating samples required for Bayesian Monte Carlo inference, namely 
Markov chain Monte Carlo (MCMC) methods. The topics we covered 
included the Metropolis algorithm, the Metropolis Hastings algorithm and 
the Gibbs sampler. 


We now present one more topic, stochastic data augmentation, and 
provide some further exercises in MCMC. These exercises will illustrate 
how many statistical problem can be cast in the Bayesian framework and 
how easily inference can then proceed relative to the classical framework. 


The examples below include simple linear regression, logistic regression 
(an example of generalised linear modelling and survival analysis), 
autocorrelated Bernoulli data, and inference on the unknown bounds of a 
uniform distribution. 


7.2 Data augmentation 


Data augmentation (DA) is a method for using unobserved data or latent 
variables so as to simplify and facilitate an iterative optimisation or 
sampling algorithm. There are two basic types of DA: deterministic DA 
and stochastic DA. An example of the former is the EM algorithm as 
described earlier. Stochastic DA is illustrated in the following example. 


Example of stochastic data augmentation 


Suppose we wish to sample from a univariate distribution defined by a 
density f(x) but that this is difficult to do directly. But then, also suppose 
that we can factor this density as 


f(x) « g(x)h(x), 


where: 


g(x) = f q(u| x)du 
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q(u | x) is the kernel of conditional density for a latent 
random variable u given x which is easy to sample 
from 

q(u|x)h(x) defines the kernel of a conditional density for x 
given u which is easy to sample from; call this 
kernel k(x |u). 


In such a situation we may define the joint distribution of u and x by the 
density 


f (u, x) ec q(u | x)h(x) . 


Then, since both of the conditional distributions (of u given x, and of x 
given u) are easy to sample from, we may define a suitable Gibbs sampler 
by the following two steps: 

(i) Sample u' ~ q(u | x) 

(ii) Sample x’ ^ k(x |u’). 


Running this Gibbs sampler will eventually result in a random sample 
(üs XJ- (UX) nid TULX): 
Discarding the simulated latent variables u,,...,u, then yields the desired 
sample, 
Xiph “iid f(x): 


This idea can be extended in a straightforward fashion to sampling from 
a multivariate distribution, i.e. where x is a vector. In such cases, it may 
be necessary to define several latent variables in the fashion described 
above. 


Exercise 7.1 Sampling with the aid of stochastic data 
augmentation 


We wish to find the mean of a random variable with density 


x 


f(x) «c -—, x>0. 
x41 


(a) Calculate the exact value of EX using numerical integration techniques. 


(b) Estimate EX using a Monte Carlo sample obtained via rejection 
sampling. 
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(c) Estimate EX using a Monte Carlo sample obtained via the Metropolis 
algorithm. 


(d) Estimate EX using a Monte Carlo sample obtained via a Gibbs sampler 
designed using the principles of data augmentation. 


Note 1: We have already seen the above density f(x) in the context 
of a previous exercise. 


Note 2: The intent of this exercise is threefold: 


(i) to illustrate stochastic data augmentation 

(ii) to provide additional practice at several Monte Carlo techniques 

(iii) to introduce an idea that will be useful later when attempting finite 
population inference under biased sampling without 
replacement. 


Solution to Exercise 7.1 


(a) Let the kernel be k(x) = —. 
x41 


Then, using the integrate() function in R, we obtain 


[kGOdx = 0.59635 and | xk(x)dx = 0.40365. 
0 0 


So EX = 0.40365/0.59635 = 0.6769. 


(b) A suitable envelope is the standard exponential density 
h(x)yse "x0, 

for which the acceptance probability is 

k(x) 

ch(x)' 


p(x)- 
where 
KX) | ax & (0D 7 
h(x) | oe o oo 


č = max 1. 
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Thus p(x)- X" 
x41 


Applying this algorithm we obtained a random sample of size J — 1,000 
using a total of 1,651 draws from the envelope. (Thus the acceptance rate 
was 1,000/1651 = 61%.) Using this Monte Carlo sample, we estimated 
EX as 0.6875 with 9596 CI (0.6402, 0.7349). 


Figure 7.1 shows a trace plot of the simulated values and (just for interest) 


the associated sample ACF of these values (showing the complete absence 
of autocorrelation), respectively. 


Figure 7.1 Trace plot and sample ACF 
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(c) Using a normal driver distribution centred at the last value and with 
standard deviation 0.6 we ran a Metropolis algorithm for 40,500 iterations, 
starting at x = 1. We kept every 40th sampled value after first discarding 
the first 500 iterations as the burn-in. Using the resulting Monte Carlo 
sample of size 1,000, we estimated EX as 0.7049 with 9596 CI (0.6561, 
0.7537). The overall acceptance rate of the algorithm was 5896. Figure 7.2 
shows a trace plot of all 40,500 simulated values, the sample ACF of those 
values (showing a very strong autocorrelation), a trace plot of the 1,000 
values used for inference, and the sample ACF of those values (showing 
very little autocorrelation). 


Figure 7.2 Trace plots and sample ACFs 
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1 oo 
(d) Observe that —— = f e "dw | 
x+1 *$ 


1 oo 
Therefore f(x) » ——e " œ f e MWe Xdw , 
x4+1 A 


Hence we may define an artificial latent variable w such that the joint 
density of w and x is 


f(w,x)oce "e^, w>0,x>0. 


We see that: 


—(x+1)w 
3. 


f (w|x) oc f (w, x) e 9", w>0 


—(w+1)x 
5 


f (x|w)« f(w,x)oe x. 
So, a Gibbs sampler for sampling from f(w,x) is defined by the two 
densities: 

f(w|x) 2 (x e 9", w>0 

f(x|w) 2 (w* De "9, x » 0, 

or equivalently by the two steps: 

Sample w ~ Gamma(1, x +1) 

Sample x ~ Gamma(1, w +1). 


Starting at x = 1, we ran this Gibbs sampler for 5,100 iterations. We then 
kept every 5th sampled value after first discarding the first 100 iterations 
as the burn-in. Using the resulting Monte Carlo sample of size 1,000 we 
estimated EX as 0.7172 with 9596 CI (0.6671, 0.7673). 


Figure 7.3 shows a trace plot of all 5,100 simulated values, their sample 
ACF (showing a slight autocorrelation), a trace plot of the 1,000 values 
used for inference, and the sample ACF of these 1,000 values (showing 
very little autocorrelation). 


Note that similar plots could also be produced for the simulated latent 
variable, w. Also note how data augmentation and a Gibbs sampler have 
resulted in a usable Monte Carlo sample more easily and effectively than 
the Metropolis algorithm. 
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Figure 7.3 Trace plots and sample ACFs 
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R Code for Exercise 7.1 


# (a) 

options(digits=5); kfun=function(x){ exp(-x)/(x+1) } 
c=integrate(f=kfun,lower=0,upper=Inf)Svalue; c # 0.59635 
xkfun =function(x){ x*exp(-x)/(x+1) } 
top=integrate(f=xkfun,lower=0,upper=Inf)Svalue; top # 0.40365 
EX=top/c; EX # 0.67688 


# (b) 
J=1000; xv=rep(NA,J); ct=0; set.seed(331) 
for(j in 1:J){ acc=F; while(acc==F){ ct=ct+1 
x=rgamma(1,1,1); p=1/(x+1); u=runif(1); if(u<p){ acc=T; xv[jJ=x } } } 
xbarzmean(xv); ci=xbar + c(-1,1)*qnorm(0.975)*sd(xv)/sqrt(J) 
c(ct,xbar,ci) #1651.00000 0.68754 0.64016 0.73492 
par(mfrow=c(2,1)); plot(1:J,xv,typez"l") 
acf(xv)Sacf[1:5] # 1.0000000 -0.0205516 -0.0100987 -0.0040018 0.0732520 
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# (c) 

MET «- function(K,x,c){ 

# This function applies the Metropolis algorithm to sampling from 
# f(x)~exp(-x)/(x+1),x>0. 


# Inputs: K = total number of iterations 

# x = intial value of x, c = standard deviation of normal driver 
# Ouputs: Sxv = vector of (K+1) values of x, Sar = acceptance rate 

xv =x; ct=0 


for(j in 1:K){ 
xp = rnorm(1,x,c) 
if(xp»0) { 
q = (-xp-log(xp+1)) - (-x-log(x+1)); p = exp(q); u = runif(1) 
if(u« p)( x2xp; ct=ct+1} 
} 
xv <- c(xv,x) } 
ar =ct/K; list(xv=xv,ar=ar) } 


K=40500; set.seed(298); res <- MET(K=K,x=1,c=0.6); resSar # 0.53896 
par(mfrow=c(2,2)); plot(0:K, resSxv,type="I") 

acf(resSxv)Sacf[1:5] # 1.00000 0.91458 0.83710 0.76808 0.70716 
xv=resSxv[-(1:501)][seq(40,40000,40)]; plot(1:J,xv,type="I") 
acf(xv)Sacf[1:5] # 1.0000000 0.0727149 -0.0088327 0.0265807 0.0592275 
xbar=mean(xv); ci=xbar + c(-1,1)*qnorm(0.975)*sd(xv)/sqrt(J) 

c(xbar,ci) # 0.70491 0.65614 0.75368 


# (d) 
GIBBS <- function(K,x){ 
# This generates a sample using the Gibbs sampler and data augmentation. 
# Inputs: K = total number of iterations, x = initial value of x 
# Ouputs: Sxv = vector of (K+1) values of x, Swv = vector of (K+1) values of w 
xv = x; WVZNA; for(j in 1:K){ 

w=rgamma(1,1,x+1); x=rgamma(1,1,w+1); xv=c(xv,x); wv=c(wv,w) } 
list(xv2xv,wvzwv) } 


K=5100; set.seed(319); res <- GIBBS(K=K,x=1) 
par(mfrow=c(2,2)); plot(0:K, resSxv,type="I") 
acf(resSxv)Sacf[1:5] # 1.0000000 0.0692628 0.0407747 0.0053119 -0.0133717 
xv=resSxv[-(1:101)][seq(5,5000,5)]; plot(1:J,xv,type="I") 
acf(xv)Sacf[1:5] 
# 1.0000e+00 -2.4435e-02 4.5681e-02 -3.1778e-02 2.7116e-05 
xbar=mean(xv); ci=xbar + c(-1,1)*qnorm(0.975)*sd(xv)/sqrt(J) 
c(xbar,ci) # 0.71720 0.66711 0.76729 
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Exercise 7.2 Comparison of classical and Bayesian simple linear 
regression (and practice at various statistical techniques) 


Consider the following simple linear regression model: 
Y, ^L N(u,o^),i 7 1,....n, 

where 
h; = a+ bx, 

(linear predictor for a value with covariate x, ). 


(a) Generate a data vector y = (y,,..., y,) from the model, using: 
n=10,a=5,b=2,0=2, 
and with covariates 
x, =i 
for all i = 1,...,n. 


(b) Conduct a classical analysis of the data in (a). Report the MLEs and 
95% CIs for a and b. Also create a single graph which shows: 


e the data values 


e the true regression line E(Y | x) = a + bx 

* the fitted regression line É(Y | x) = â + bx 

* two lines showing the 9596 CI for the regression line 

* two lines showing the 9596 prediction interval at each value of x. 
(c) Perform a Bayesian analogue of the inference in (b) using the 
Metropolis-Hastings algorithm and a Monte Carlo sample of size 


J = 2,000. 


Use a suitable joint uninformative and improper prior for the three 
parameters in the model. 


(d) Create a single graph showing all the information in the two graphs in 
(b) and (c). 


Note: The Bayesian analysis in (c) could also be performed via the 
Gibbs sampler. 
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Solution to Exercise 7.2 


(a) The simulated data are shown in Table 7.1. Note that x, — i. 


Table 7.1 Simulated data 


i 1 2 3 4 

y, 5.879 8.54 14.12 13.14 

i 6 7 8 9 
20.43 19.92 18.47 21.63 


" $0 3X —- Y) 
(b) The MLE of bis b = -E—— — ————— = 1.836, 


Da- 


and the MLE of a is then à — y — bx = 6.051. 


An unbiased estimate of o^(—1/A = 4) is 


ig Pee 
s! =) Vy, {4 + bx})” = 3.816. 
n-—2ji4 


1 
1 
Let: X=). 
1 n 
M= Mı a - (xay t 
mj, m», 
A 9596 CI for a is then 


(4 £6, 05(8)s,/m,,) = (1.340, 2.332), 


and a 95% CI for b is 
(6 + t, os (8)s./m,, = (2.973, 9.128). 
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Also, a 9596 CI for E(Y | x) = a 4- bx is 


J 


and a 9596 prediction interval for a new observation Y with covariate x is 


i 


(G+ bx) +ty9)5(8)s, (1. x)M 


(à + bx) +t ,,.(8)s j+ (1 x)M 


The required graph is shown in Figure 7.4. 


Figure 7.4 Classical inference 


15 20 25 30 35 


10 


True mean of Y given x 


uis xat *—-* Least squares fit 
s pr **** 9596 Cl for mean 
` 95% prediction interval 


(c) A suitable Bayesian model is given by: 
(Y, |a,b, 4) ~L N(a-- bx, 1/ A), i= 1,...,n 


f(a,b,A)e«1/A, abe R, A>O (where 4 -1/o^). 
Let us now solve this Bayesian model so as to estimate the posterior means 


and 9596 CPDRs for a and b. The joint posterior density of the three model 
parameters is 


Gb paso Tee | 0: - 


(where ju, = a 4- bx, as already defined). 
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Hence the joint log-posterior density (up to an additive constant) is 


n Ax 
I(a,b, A) = [2-10-2350 — 14) . 


i=l 


Applying the MH algorithm for 2,500 iterations, we obtain traces for the 
three parameters as shown in Figure 7.5. The horizontal lines show the 
true values of the three parameters. The fourth subplot (bottom right) is a 
histogram of the last 2,000 values of b simulated. 


Figure 7.5 Results of a MH algorithm 
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Using output from the last 2,000 iterations only, we estimate the posterior 
mean and 95% CPDR for a (= 5) as 6.3445 and (3.578, 8.808), and the 
same for b (= 2) are about 1.7881 and (1.392, 2.234). 


Figure 7.6 shows the Bayesian analogue of Figure 7.5 in part (b). 
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Figure 7.6 Bayesian inference 
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(d) The required graph is shown in Figure 7.7. 


Figure 7.7 Comparison of inferences 
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R Code for Exercise 7.2 


JE (a) E SOURROIORSCIOOR SOROR SOR HOIOR BOR SRHIORHIORHERORSEHON SE GIOKNIGICKOR SR GOICROIGICR 
options(digits=4) 

n <- 10; a <- 5; b <- 2; lam <- 0.25; sig <- 1/sqrt(lam); c(sig,sig*2) #2 4 
xdat <- 1:n; set.seed(123); ydat <- rnorm(n,a+b*xdat,sig) 
rbind(xdat,ydat) 

# xdat 1.000 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 

# ydat 5.879 8.54 14.12 13.14 15.26 20.43 19.92 18.47 21.63 24.11 


# (b) K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K Æ K K K K K K K K K 


fit <- Im(ydat ~ xdat); summary(fit) 

H Estimate Std. Error t value Pr(>|t]) 

# (Intercept) 6.051 1.335 4.53 0.0019 ** 
# xdat 1.836 0.215 8.54 2.7e-05 *** 


ahat <- coef(fit)[[1]]; bhat <- coef(fit) [[2]] 
sse <- sum((ydat-(ahat-bhat*xdat))^2) 
sig2hat <- sse/(n-2); lamhat <- 1/sig2hat 
c(sse,sig2hat,lamhat) # 30.532 3.816 0.262 


df <- length(ydat)-length(fitScoef) 

aCl <- ahat + c(-1,1)*qt(0.975,df)*sqrt(sig2hat*summary(fit)Scov.unscaled[1,1]) 
acl 4 2.973 9.128 

bCI <- bhat + c(-1,1)*qt(0.975,df)*sqrt(sig2hat*summary(fit)Scov.unscaled[2,2]) 
bCI # 1.340 2.332 


xxv <- seq(0,n,0.1); nn <- length(xxv) 

Xmat <- cbind(1,xxv) 

muhat «- Xmat 96*96 fit$coef 

muhatvar <- sig2hat * diag(Xmat %*% summary(fit)Scov.unscaled 96*96 t(Xmat)) 
df <- length(ydat)-length(fitScoef) 

muhatlb «- muhat - qt(0.975,df) * sqrt(muhatvar) 

muhatub «- muhat + qt(0.975,df) * sqrt(muhatvar) 


predlb <- muhat - qt(0.975,df) * sqrt(sig2hat+muhatvar) 
predub «- muhat + qt(0.975,df) * sqrt(sig2hat+muhatvar) 


X11(w=8,h=5); par(mfrowzc(1,1)) # Figure 

plot(xdat, ydat, pch=16, xlimzc(0,11),ylimzc(0,35),xlabz"x",ylabz"y" ) 
abline(c(a,b),lwd=2); 

lines(c(0,n),c(fitScoef[1], fitscoef[1]+ fitscoef[2]*n),lty=4, lwd=2) 
lines(xxv,muhatlb,Ityz3,Iwdz2) 
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lines(xxv,muhatub,Ityz3,Iwdz2) 

lines(xxv, predlb,Ityz2,lwdz2) 

lines(xxv, predub,Ity=2,lwd=2) 

legend(6,12,c("True mean of Y given x","Least squares fit","95% Cl for mean", 
"95% prediction interval"), lty=c(1,4,3,2),lwd=rep(2,4)) 


# (c) FK K 2 K K K K K K K OK K K OK K OK K OK K OK K OK OK K OK K OK K OK K K OK OOK OK K K K OK K K OK OK OK K K K OK KOK KKK K K K KKK 
MH.SLR <- function(Jp, x,y, a,b,lam, asd, bsd, lamsd){ 


# This function implements a Metropolis Hastings algorithm for a 
# simple linear regression model with uninformative priors. 


# Inputs: Jp = total number of iterations 

H x = vector of covariates 

# y = vector of observations 

# a,b,lam = starting values of a,b,lambda 

# asd,bsd,lamsd = st. dev.s of drivers for a,b,lambda. 

# Outputs: Sav,Sbv,Slamv = (Jp+1)-vectors of values of a,b,lambda 
# Saar,Sbar,Slamar = acceptance rates for a,b,lambda. 


av <- a; bv <- b; lamv <- lam; ybar <- mean(y); n <- length(y) 
act <- 0; bct <- 0; lamct <- 0 
logpost «- function(n, x, y, a, b, lam){# logposterior 
(n/2 - 1) * log(lam) - 0.5 * lam * sum((y-a-b*x)^2) } 
for(j in 1:Jp) { 
ap «- rnorm(1, a, asd) & propose a value of a 
k <- logpost(n=n, x=x, y=y, a-ap, b=b, lam=lam) - 
logpost(n=n, x=x, y=y, a=a, b=b, lamzlam) 
p«-exp(k) # acceptance probability 
u «-runif(1); if(u«p) ( a«-ap; act«-act* 1 } 
bp «- rnorm(1, b, bsd) # propose a value of b 
k <- logpost(nzn, x=x, y=y, a=a, b=bp, lamzlam) - 
logpost(nzn, x=x, y=y, a=a, b=b, lam=lam) 
p <- exp(k) # acceptance probability 
u <- runif(1); if(u« p) ( b <- bp; bct <- bct 1 } 
lamp «- rnorm(1, lam, lamsd) & propose a value of lambda 
if(lamp > 0) ( # automatically reject if lamp < 0 
k«- logpost(nzn, x=x, y=y, a=a, b=b, lam=lamp) - 
logpost(n=n, x=x, y=y, a=a, b=b, lam=lam) 
p <- exp(k) # acceptance probability 
u<-runif(1); if(u« p) ( lam <-lamp; lamct «- lamct +1 } 
) 
av <- c(av, a); bv <- c(bv, b); lamv <- c(lamv, lam) 
) 
list(av = av, bv = bv, lamv = lamv, aar = act/Jp, bar = bct/Jp, lamar = lamct/Jp) 


} 
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Jp «- 2500; set.seed(441) 

mh <- MH.SLR(Jp=Jp, x=xdat,y=ydat, a=0,b=0,lam=1, 
asd=1.2,bsd=0.2,lamsd=0.2) 

c(mhSaar,mhSbar,mhSlamar) # 0.5228 0.5008 0.5132 


X11(w=8,h=6); par(mfrowzc(2,2)) & Figure 
plot(0:Jp,mhSav,xlab="j",ylab="a_j",type="I"); abline(h=a) 
plot(0:Jp,mhSbv,xlabz"j",ylabz"b j", type="I"); abline(h=b) 
plot(0:Jp,mhSlamv,xlabz"j",ylabz" lambda j", type="I"); abline(h=lam) 
hist(mhSbv[-(1:501)],mainz" ",xlabz"b") 


burn «- 500; J «- Jp - burn; J 4 2000 

av <- mhSav[-c(1:(burn-1)); abar «-mean(av) 

bv <- mhSbv[-c(1:(burn+1))]; bbar <- mean(bv) 

lamv <- mhSlamv[-c(1:(burn+1))]; lambar <- mean(lamv) 


sig2bar «- mean(1/lamv) 
c(abar,bbar,lambar,sig2bar) # 6.3445 1.7881 0.2758 4.7505 


quantile(av,c(0.025,0.975)) #3.578 8.808 
quantile(bv,c(0.025,0.975)) #1.392 2.234 


cpdrLBs <- xxv; cpdrUBs <- xxv; predLBs <- xxv; predUBs <- xxv; set.seed(171) 
for(i in 1:nn){ 

mus <- av + bv*xxv[i] 

cpdrLBs[i] <- quantile(mus,0.025) 

cpdrUBs[i] «- quantile(mus,0.975) 

sim <- rnorm(J,mus,1/sqrt(lamv)) 

predLBs[i] <- quantile(sim,0.025) 

predUBs[i] «- quantile(sim,0.975) 

) 


X11(w=8,h=5); par(mfrowzc(1,1)) 4 Figure 
plot(xdat,ydat,pchz16,xlimzc(0,11),ylimzc(0,35),xlabz"x",ylabz"y" ) 
abline(c(a,b),lwd=2); lines(c(0,n),c(abar, abar + bbar *n),Ityz4, lwd=2); 
lines(xxv,cpdrLBs, Ity=3,lwd=2) 
lines(xxv,cpdrUBs,Ityz3, lwd=2) 
lines(xxv,predLBs,Ityz2, lwd=2) 
lines(xxv, predUBs, Ity=2, lwd=2) 
legend(6,12,c("True mean of Y given x","Posterior mean of mean", 

"9596 CPDR for mean","95% prediction interval"),lty=c(1,4,3,2),lwd=rep(2,4)) 
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X11(w=8,h=5); par(mfrowzc(1,1)) 4 Figure 

plot(xdat,ydat,pchz16, xlimzc(0,11),ylimzc(0,35),xlabz"x",ylabz"y" ) 
abline(c(a,b),lwd=2) # True regression line 

# Classical lines 

lines(c(0,n),c(fitScoef[1], fitscoef[1]+ fit$coef[2]*n),Ityz2, lwd=2) 
lines(xxv,muhatlb,Ityz2, lwd=2);  lines(xxv,muhatub,Ityz2, lwd=2) 
lines(xxv, predlb,Ityz2, lwd=2); lines(xxv, predub,Ityz2, lwd=2) 
# Bayesian lines 

lines(c(0,n),c(abar,abar+n*bbar),lty=4, lwd=1) 
lines(xxv,cpdrLBs,Ityz4, lwd=1);  lines(xxv,cpdrUBs,Ityz4, lwd=1) 
lines(xxv,predLBs,Ityz4, lwd=1);  lines(xxv,predUBs,Ityz4, lwd=1) 


legend(6,12,c("True mean of Y given x", 
"Classical inference","Bayesian inference"), lty=c(1,2,4), lwd=c(2,2,1)) 


Exercise 7.3 Comparison of classical and Bayesian logistic 
regression (an example of GLMs) (and practice at various 
statistical techniques) 


Table 7.2 shows data on the number of rats who died in each of n = 10 
experiments within one month of being administered a particular dose of 
radiation. For example in Experiment 3, a total of 40 rats were exposed to 
radiation for 3.6 hours, and 23 of them died within one month. Thus an 
estimate of the probability of a rat dying within one month if it is exposed 
to 3.6 hours of radiation is 23/40 = 57.5%. 


Table 7.2 Rat mortality data 


i n; Xi Yi yıl n, = P: 

1 10 0.1 1 1/10 = 0.1 
2 30 1.4 0 0/30 = 0 

3 40 3.6 23 23/40 = 0.575 
4 20 3.8 12 12/20 = 0.6 
5 15 5.2 8 8/15 = 0.5333 
6 46 6.1 32 32/46 = 0.696 
7 12 8.7 10 10/12 = 0.833 
8 37 9.1 35 35/37 0.946 
9 23 9.1 19 19/23 = 0.826 
10 8 13.6 8 8/8 = 1 
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Consider the following logistic regression model for these data: 
Y, ~L Bin(n,, Pils f91233, 


where: 
1 
p;————— — (probability of a ‘success’ for experiment i) 
1+ exp(—z,) 
z, =a +bx, (linear predictor). 


(a) Find the ML estimates of a and b using the glm() function in R. For 
each parameter also calculate a suitable 95% CI. 


(b) Find the ML estimates and associated 95% CIs in R using your own 
code for the Newton-Raphson algorithm and without using the glm() 
function. 


(c) Find the ML estimates using a modification of the Newton-Raphson 
algorithm which does not require the inversion of matrices. 


(d) Suppose that a and b are assigned independent flat priors over the 
whole real line. Thus consider the Bayesian model: 
(Y. |a,b) ~L Bin(n,, p), i91. 


1 
¿=-~ (probability of death for experiment i) 
1+ exp(—z,) 
z, =a +bx, (linear predictor) 


f(a,b)x1, abER. 


Use the MH algorithm to get a sample of J = 10,000 observations from 
f(a,b| y), where y= (y. y,). 


Hence estimate the posterior means of a and b, together with 95% MC CIs 
for these estimates, and also estimate the 9596 CPDRs. 


Show graphs of the traces and histograms. Overlay the MC estimates and 
MLES over the traces, together with 9596 CPDRs and CIs, respectively. 
Also, overlay kernel density estimates over the histograms. 


(e) Use the sample in (d) to estimate p(x), the probability of a rat dying if 
it is exposed to x hours of radiation, for each x = 0,1,2,...,15. 


Graph these results with a line in a figure which also shows the 10 p, 
values. 
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Also include: 
* the MC 95% CI for each estimate of p(x) (i.e. for each E{p(x) | y}) 
* the MC 9596 CPDR for each p(x) 
* the MLE of each p(x) using standard GLM procedures, 
together with associated large-sample 9596 CIs. 


(f) Suppose that 20 more rats are about to be exposed to exactly five hours 
of radiation. Use the sample in (d) to estimate how many of these 20 rats 
will die, together with a 9596 CI for your estimate. Also construct an 
approximate 9596 prediction region for the number of rats that will die 
and report the estimated actual probability content of this region. 


(g) Use the sample in (d) to estimate LD50, the lethal dose of radiation at 
which 5096 of rats die, together with a 9596 CPDR. Also compute an 
estimate and 9596 CI for LD50 using standard GLM techniques. 


(h) Consider the Bayesian model and data in (d). Modify the model 
suitably so as to constrain the probability of death at a dose of zero to be 
exactly zero. Estimate the parameters in the new model and draw a graph 
similar to the one in (e) which shows the posterior probability of death for 
each dose x from zero to 15, together with the associated 9596 CPDRs. 


Solution to Exercise 7.3 


(a) Using the glm() function in R, we find that the MLE and 95% CI for 
a are —2.156 and (—2.9998, —1.3113). Also, the MLE and 95% CI for b are 
0.5028 and (0.3456, 0.6601). 


(b) Since the priors on a and b are flat, finding the maximum likelihood 
estimate of (a,b) is the same as finding the posterior mode of (a,b). Now, 
the posterior density of a and b is 


f(a,b| y) [p"*Q— p)”. 
i=1 


So the log-posterior is 
(a,b) = log f (a,b| y) - 5 ^q, . 
i=l 
where q; = y,log p, t (n; — y;)log(1— p;) 
= y,z, —n, log(1+ exp(z,)) (after some algebra). 
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dq, dq, 
Let d,—-——-—y-np, d,, — —- = (y, — n.p.)x, 
li da Ji i Di 2i db (y; qx) i 
du dq, 
da=- BUB. da= -c-—nnp(l- p,)x, 
11i da? bi Di) 12 dadb [Ait D;) 
d? ; n n 
d, = dp =—n,p,(1— p,)x; , d, =} á, ; d; d. 
i-l il 


d = 2 di , d,, m udis , di 22; p» di 
i=] i=l 


i=1 


y= 


a D=D a d, M=M "m d, d; 
bJ TA d, ? SA d; d; l 


Then the NR algorithm is defined by 
v, 2v, ,—M(v,  ! D(v, )), t= 1,2,3..... 


Starting from the origin, the iterates of a and b are as shown in Table 7.3. 


Table 7.3 Results of a Newton-Raphson algorithm 


2 3 4 5 
-1.474 -2.013 -2.148 -2.156  -2.156 
0.3369 0.4670 0.5008 0.5028 0.5028 


Thus the MLEs of a and b are â = —2.156 and b = 0.5028. This agrees 
perfectly with the results in (a). 


A 9596 CI for a is (Â tys (8)s,) and a 9596 CI for b is (b +t,os(8)s,), 


where: 


too2s (8) = 2.306 
s? is the top left element of V 


a 


S; is the bottom right element of V 
V —ó(X'WX) (a2 by 2 matrix) 


T X 

1 X Ww. 
@=1,X=|, ?|, W=diag(w,....w,), W, =———— 

PO i V(u)g (i 

1 x 


n 
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; ———— — ——— (MLE of the probability at x= x, 
1+ exp(—Z,) ( P y i) 


Z,=a+bx, (MLE of linear predictor at x = x, ) 


V(u)=u(1— u), g(u)-log A (logit link function) 
=j 


TEE: 
$007 a-uy 


We find that w, = n;p,(1— p,). Numerically, we find that 95% CIs for a 


and b are (—3.000, —1.311 ) and (0.3456, 0.6601), respectively. These 
results agree with those in (a). 


(c) At each iteration t = 1,2,3,4,..., we: 


1. Fix b and perform a NR step towards maximising wrt a: 
a,,, — a, —d,(a,)/d,,(a,) 


2. Fix a and perform a NR step towards maximising wrt b: 
b. =b, - d, (b)! d, (b). 


t+1 


Starting from the origin (a, b) = (0,0) we obtain the results in Table 7.4. 


Table 7.4 Results of a search algorithm 


t 0 1 2 3 4 

a, 0 0.4564  —0.45034 -0.06132 . —0.7294 
b, 0 0.1401 0.09223 0.20571 0.1690 
t 20 21 99 100 


—1.8585 —1.8619 -2.1555 —2.1555 
0.4424 0.4532 0.5028 0.5028 


We see that this modified and simpler algorithm converges more slowly 
than plain NR. Also, it is less stable, as it fails to converge if started from 
(a, b) 7 (0.3, 0.3), unlike plain NR. Both algorithms fail to converge if 
started from (0.5, 0.5). (See the R code below for details.) 
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(d) We apply the Metropolis Hastings algorithm with a burn-in of 500 and 
starting from the origin to get a sample of size of J = 10,000 from 

f (a, b| y). The acceptance rates were 37% for a and 55% for b. The 
Markov chain was not thinned for subsequent inference, meaning that the 
CIs obtained below are perhaps narrower than they should be. 


The MC estimate of E(a | y) is 22.207 (similar to the MLE, —2.156), with 
95% CI (—2.214, —2.199) and 9596 CPDR (-2.963, —1.521). 


The MC estimate of E(b | y) is 0.5145 (similar to the MLE, 0.5028), with 
9596 CI (0.5132, 0.5158) and 9596 CPDR (0.3895, 0.6605). 


Traces and histograms of the sampled values of a and b are shown in 
Figure 7.8. 


Figure 7.8 Results of MH algorithm 
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(e) The required results are shown in Figure 7.9. 


Note: Figure 7.9 shows that the probability of a rat dying when given no 
radiation is about 1096. We should interpret this result and the graph 
near x = 0 with caution. Ideally, we would conduct another experiment 
with only small values of x and a second logistic regression, perhaps 
using the log of x as the explanatory variable. On the other hand, maybe 
the 1096 figure is reasonable because rats could die within one month 
for reasons other than radiation. Alternatively, we could modify our 
model so as to force p(0) = 0 (see (h) below). 


Figure 7.9 Mortality rate estimates 
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(f) Let d be the number of rats which will die if exposed to radiation for 
five hours. Then 

(d | y, a, b) ~ Bin(20, p(a,b)), 
where 

p(a,b) = 1/(1 + exp(-a — 5b)). 


We can now apply the method of composition whereby 
f (d,a,b| y) — f(d | y,a,b) f(a,b|y). 


Thus for each sampled (a,b) we calculate p(a,b) and sample from the 
binomial distribution of d above. The frequencies of the resulting 10,000 
values of d are shown in Table 7.5 
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Table 7.5 Simulated frequencies of rats dying 


d 3 4 5 6 7 8 
frequency 1 3 20 75 217 472 

d 9 10 11 12 13 14 
frequency | 845 1188 1562  1733* 1546 1123 

d 15 16 17 18 19 
frequency | 709 332 131 37 6 


Using the 10,000 values of d, our estimate of d is 11.81 (the average of 
the 10,000 values), with (11.76, 11.85) as the 95% MC CI for d's posterior 
mean. We feel about 95.196 confident that the number of rats which die 
will be between 8 and 16, inclusive (since 95.196 of the simulated d values 
are in this range). Also, it is most likely that 12 of the 20 rats will die, 
because the MC estimate of Mode(d | y) is 12 (since d = 12 above has the 
highest frequency, namely 1,733, as marked by an asterisk). 


(g) First observe that the LD50 is the value of x such that p(x) — 0.5. 
Solving 1/(14- exp(—a — bx)) = 0.5, we get x = LD50 = —a/b. 


Using the sample of 10,000 in part (f), we estimate the posterior mean of 
LD50 is 4.279, with 9596 MC CI (4.273, 4.286). The MC 9596 CPDR for 
LD50 is (3.584, 4.916). Thus we can be 9596 confident that the dose 
required to kill half of a large number of rats is between 3.6 and 4.9. 


Using standard GLM procedures and the delta method we estimate LD50 
as 4.287 (the MLE) with 9596 CI (3.532, 5.042). Thus we can be 9596 
confident that the dose required to kill half of a large number of rats is 
between 3.5 and 5.0. We see that Bayesian and classical methods have 
resulted in inferences which are very similar. 


(h) An alternative to the logistic model in (d), one with zero probability 
of death at zero dosage of radiation, is as follows: 
(Y,]a,b) ^L Bin(n, p), i=1,...,n 
p, —l-exp(-z), z, = ax, +bx; 
f(a,b)x1, ab 0. 
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Running a suitable modification of the MH algorithm in (d), we estimate 
a and b as 0.11 and 0.017, with respective 9596 CPDRs (0.04, 0.20) and 
(0.004, 0.030). The required graph is shown in Figure 7.10. 


Figure 7.10 Modified mortality rate estimates 
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R Code for Exercise 7.3 


# (a) K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K Æ K K K K OB K K K KK K K K K K K K K 


nvec <- c(10,30,40,20,15,46,12,37,23,8) 
xvec <- c(0.1,1.4,3.6,3.8,5.2,6.1,8.7,9.1,9.1,13.6) 
yvec <- c(1,0,23,12,8,32,10,35,19,8) 
pvec <- yvec/nvec 

options(digits=4) 
cbind(xvec,nvec,yvec,pvec) 

E xvec nvec yvec pvec 

4 [1,] 0.1 10 10.1000 

# [2,] 1.4 30 00.0000 

# [3,] 3.6 40 230.5750 

# [4,] 3.8 20 120.6000 

# [5,] 5.2 15 80.5333 

4 [6,] 6.1 46 320.6957 

# [7,. 8.7 12 100.8333 

4 [8,] 9.1 37 350.9459 

4 [9,] 9.1 23 190.8261 

# [10,] 13.6 8 81.0000 
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fit <- glm(pvec"xvec,familyzbinomial(linkzlogit),weightsznvec) 
fitScoef # -2.1555 0.5028 

summary(fit)Scov.unscaled 

# (Intercept) xvec 

# (Intercept) 0.13404 -0.022442 

# xvec -0.02244 0.004651 


alpse <- sqrt(summary(fit)Scov.unscaled[1,1]) 
fitalpci <- fitscoef[1] + c(-1,1)*qt(0.975,8)*alpse 
c(alpse,fitalpci) # 0.3661 -2.9998 -1.3113 


betse <- sqrt(summary(fit)Scov.unscaled[2,2]) 
fitbetci <- fitScoef[2] + c(-1,1)* qt(0.975,8)*betse 
c(betse,fitbetci) # 0.0682 0.3456 0.6601 


NR.LOGISTIC <- function(m,alp, bet,xv,nv,yv){ 
# Performs logistic regression via the Newton-Raphson algorithm. 


H Inputs: m = number of iterations 

H alp, bet = starting values of alpha and beta 

# xv, nv, yv = vectors of covariates, sample sizes and 
# numbers of successes, respectively. 

# Outputs: Salpv = vector of (m+1) alpha values 

# Sbetv = vector of (m+1) beta values 


alpv <- alp; betv <- bet; ve <- c(alp, bet) 
for(t in 1:m){ 
pv <- 1/(1+exp(-alp-bet*xv)) 
d1 <- sum(yv - nv*pv); d2 <- sum((yv - nv*pv)*xv) 
d11 <- -sum(nv*pv*(1-pv)); d12 «- -sum(nv*pv*(1-pv)*xv) 
d22 «- -sum(nv*pv*(1-pv)*xv^2) 
D «- c(d1,d2) 
M «- matrix(c(d11,d12,d12,d22),nrow=2) 
ve <- ve - solve(M) %*% D 
alp «- ve[1]; bet «- ve[2] 
alpv «- c(alpv,alp); betv «- c(betv,bet) 
) 
list(alpv=alpv, betv=betv) 
} 
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options(digits=4) 

nrres <- NR.LOGISTIC(m=20,alp=0, bet=0,xv=xvec,nv=nvec,yv=yvec) 
nrres 

# Salpv: [1] 0.000 -1.474 -2.013 -2.148 -2.156 -2.156 .... 

# Sbetv: [1] 0.0000 0.3369 0.4670 0.5008 0.5028 0.5028 .... 


NR.LOGISTIC(m=20,alp=0.3, bet=0.3,xv=xvec,nv=nvec,yv=yvec) 
# Salpv: [1] 0.000 -1.474 -2.013 -2.148 -2.156 -2.156.... 
# Sbhetv: [1] 0.0000 0.3369 0.4670 0.5008 0.5028 0.5028 .... 


NR.LOGISTIC(m=20,alp=0.5, bet=0.5,xv=xvec,nv=nvec,yv=yvec) 
# Error in solve.default(M) : 

# system is computationally singular: reciprocal condition 

# number = 9.01649e-18 


alpmle <- nrresSalp[21]; betmle <- nrresSbet[21] 
X <- chind(1,xvec) 

zmle «- alpmle + betmle*xvec & linear predictor 
pmle «- 1/(1 + exp(-zmle)) 

wtvec «- nvec*pmle*(1-pmle) 

W «- diag(wtvec) 

varmat «- solve(t(X) 96*96 W 96*96 X) 


varmat 
H 0.13404 -0.022442 
H -0.02244 0.004651 


qt(0.975,8) | 82.306 
alpmle + c(-1,1)*qt(0.975,8)*sqrt(varmat[1,1]) 4 -3.000 -1.311 
betmle + c(-1,1)*qt(0.975,8)*sqrt(varmat[2,2]) # 0.3456 0.6601 


# (c) K K K K K K K K K K K K OB K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K 


NRMOD.LOGISTIC <- function(m,alp,bet,xv,nv,yv){ 
# Performs logistic regression via a modification of the Newton-Raphson 


# algorithm. 

# Inputs: m = number of iterations 

# alp, bet = starting values of alpha and beta 

H XV, nv, yv = vectors of covariates, sample sizes and 
H numbers of successes, respectively. 

# Outputs: Salpv = vector of (m+1) alpha values 

# Sbetv = vector of (m+1) beta values 


alpv <- alp; betv <- bet; ve <- c(alp,bet) 
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for(t in 1:m){ 
pv <- 1/(1+exp(-alp-bet*xv)) 
d1 <- sum(yv - nv*pv) 
d2 «- sum((yv - nv*pv)*xv) 
d11 «- -sum(nv*pv*(1-pv)) 
d22 «- -sum(nv*pv*(1-pv)*xv^2) 
alp «- alp - d1/d11 
bet «- bet - d2/d22 
alpv «- c(alpv,alp); betv «- c(betv,bet) 
} 
list(alpv=alpv, betv=betv) 
} 


resnr <- NRMOD.LOGISTIC(m=100,alp=0, bet=0,xv=xvec,nv=nvec,yv=yvec) 
inc=c(1,2,3,4,5,21,22,100,101); rbind(inc-1,resnrSalpv[inc], resnrSbetv[inc]) 
#[1,] 01.0000 2.00000 3.00000 4.0000 20.0000 21.0000 99.0000 100.0000 
# [2,] 00.4564 -0.45034 -0.06132 -0.7294 -1.8585 -1.8619 -2.1555 -2.1555 

# [3,] 00.1401 0.09223 0.20571 0.1690 0.4424 0.4532 0.5028 0.5028 


resnr <- NRMOD.LOGISTIC(m=100,alp=0.3, bet=0.3,xv=xvec,nv=nvec,yv=yvec) 
rbind(inc-1,resnrSalpv[inc], resnrSbetv[inc]) 

4 [1,] 0.0 1.00000 2.0000 3.00 4.000e«00 20 21 99 100 

# [2,] 0.3 -1.72625 2.1776 -31.10 4.023e+15 NaN NaN NaN NaN 

# [3,] 0.3 -0.01407 0.6942 -21.36 2.861e+18 NaN NaN NaN NaN 


resnr <- NRMOD.LOGISTIC(m=100,alp=0.5, bet=0.5,xv=xvec,nv=nvec,yv=yvec) 
rbind(inc-1,resnrSalpv[inc], resnrSbetv[inc]) 

#[1,] 0.0 1.000 2.0 3 4 20 21 99 100 

4 [2,] 0.5 -4.532 828.1-Inf NaN NaN NaN NaN NaN 

4 [3,] 0.5 -1.090 3101.9 -Inf NaN NaN NaN NaN NaN 


xvdata «- c(0.1,1.4,3.6,3.8,5.2,6.1,8.7,9.1,9.1,13.6) 
yvdata «- c(1,0,23,12,8,32,10,35,19,8) 

nvdata «- c(10,30,40,20,15,46,12,37,23,8) 

pvdata «- yvdata/nvdata 


MHLR <- function(burn,J,a0,b0,xv,yv,nv,sa,sb){ 
# Performs the Metropolis-Hastings algorithm for a logistic regression model. 


# Inputs: burn = number of iterations for burn-in 

# J = required number of Monte Carlo simulations 
# a0 = starting value of alpha 

# bO = starting value of beta 
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xv = vector of xi values (length n) 

yv - vector of yi observations 

nv - vector of ni values 

sa, sb = standard deviations of the two normal driver fns. 
Outputs: Sav = vector of (burntJ+1) values of alpha (incl. starting value) 

Sbv = vector of (burn+J+1) values of beta (incl. starting value) 

Sara = acceptance rate for alpha (over last J iterations) 

Sarb = acceptance rate for beta. 


Tto Gk Gk cb HH HH 


logfun <- function(a,b,xv,yv,nv)4 
phatv <- 1/(1+exp(-a-b*xv)) 
sum( yv*log(phatv) + (nv-yv)*log(1-phatv) ) 
} 


n «-length(yv); a «-a0; b<- bO 
its <- burn + J # total number of iterations 
av «- c(a, rep(NA,its)); 
bv «- c(b, rep(NA,its)) & vectors of simulated a & b values 
arav <- c(NA, rep(0,its)); arbv <- c(NA, rep(0,its)) 
# acceptance rate vectors for a and b 


for(j in 1:its){ 
a2 «- rnorm(1,a,sa) 
logpr <- logfun(aza2,bzb,xvzxv,yvzyv,nv2nv)- 
logfun(aza,bzb, xv=xv,yv=yv,nv=nv) 
pr «- exp(logpr); u «- runif(1) 
if(u«pr)( a «- a2; arav[j+1] «- 1} 


b2 «- rnorm(1,b,sb) 

logpr <- logfun(a=a,b=b2, xvzxv,yvzyv,nv2nv)- 
logfun(a=a,b=b, xv=xv,yv=yv,nv=nv) 

pr <- exp(logpr); u <- runif(1) 

if(u<pr){ b <- b2; arbv[j+1] <- 1) 


av[j+1] «- a; bv[j+1] <- b 
) 


ara «- sum(arav[(burn+2):(its+1)])/J 
arb <- sum(arbv[(burn-*2):(its-1)])/) & acceptance rates for a & b 


list(av=av, bv=bv,ara=ara,arb=arb) 


} 
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burn «- 500; K «- 10000; its «- burn + K; set.seed(221); date() # 

res <- MHLR(burn=burn,J=K,a0=0,b0=0,xv=xvdata, 
yv=yvdata,nv=nvdata,sa=0.5,sb=0.05); date() # 10000 Took 1 second 

c(resSara,resSarb) # 0.3650 0.5544 

par(mfrow=c(2,1)); plot(resSav,type="I"); plot(resSbv,type="I") # OK 


options(digits=4); J = K; thin=1 

# thin=1 means no thinning (for experimentation) 
av <- resSav[-(1:(burn+1))][seq(thin,K,thin)]; length(av) # 10000 
acf(av)Sacf[1:5] # 1.0000 0.9283 0.8756 0.8324 0.7945 

# (very high autocorrelation) 

ahat <- mean(av); aci <- ahat + c(-1,1) * qnorm(1-0.05/2)*sqrt(var(av)/J) 
acpdr <- quantile(av,c(0.025,0.975)) 
c(ahat,aci,acpdr) # -2.207 -2.214 -2.199 -2.963 -1.521 


bv <- resSbv[-(1:(burn+1))][seq(thin,K,thin)]; length(bv) # 10000 
acf(bv)Sacf[1:5] # 1.0000 0.9363 0.8892 0.8481 0.8109 

bhat «- mean(bv); bci «- bhat + c(-1,1) * qnorm(1-0.05/2)*sqrt(var(bv)/J) 
bcpdr «- quantile(bv,c(0.025,0.975)) 

c(bhat,bci,bcpdr) # 0.5145 0.5132 0.5158 0.3895 0.6605 


dena <- density(av); denb <- density(bv) 
fit <- glm(pvdata~xvdata, family=binomial(link=logit), weights=nvdata) 
fitScoef # -2.1555 0.5028 


ase <- sqrt(summary(fit)Scov.unscaled[1,1]) 
fitaci <- fitScoef[1] + c(-1,1)*qt(0.975,8)*ase 
c(ase,fitaci) #0.3661 -2.9998 -1.3113 


bse <- sqrt(summary(fit)Scov.unscaled[2,2]) 
fitbci <- fitscoef[2] + c(-1,1)* qt(0.975,8)*bse 
c(bse,fitbci) # 0.0682 0.3456 0.6601 


X11(w=8,h=8); par(mfrow=c(2,2)) 
plot(O:its,resSav,type="I",xlab="j",ylab="a_j") 
abline(h=c(ahat,aci,acpdr)) 
abline(h=c(fitScoef[1],fitaci),ty=4) 
legend(400,0,c("MC est, 95% Cl & CPDR", 
"MLE & classical 95% Cl"), Ity=c(1,4)) 
plot(O:its,resSbv,type="I", xlab="j",ylab="b_j") 
abline(h=c(bhat,bci,bcpdr)) 
abline(h=c(fitScoef[2],fitbci),lty=4) 
legend(400,0.2,c("MC est, 95% Cl & CPDR", 
"MLE & classical 95% CI"),Ityzc(1,4)) 
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hist(av,prob=T, xlimzc(-4,0),ylimzc(0,1.5),nclassz20,xlabz"a") 
lines(denaSx,denaSy,lwd=2) 

hist(bv,prob=T, xlim=c(0.2,0.8),ylim=c(0,7),nclass=20,xlab="b") 
lines(denbSx,denbSy,lwd=2) 


# (e) K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K 


xxv <- seq(0,15,1); len <- length(xxv) 
ppv <- xxv; ppcil <- xxv; ppci2 <- xxv; ppcpdr1 <- xxv; ppcpdr2 <- xxv 


for(i in 1:len){ 
xx <- xxv[i] 
ppsim <- 1/(1+exp(-av-bv*xx)) 
pp <- mean(ppsim) 
ppci <- pp + c(-1,1)*qnorm(0.975)*sqrt(var(ppsim)/J) 
ppcpdr <- quantile(ppsim,c(0.025,0.975)) 
ppv[i] <- pp # MC estimate of E(p|xx) and so indirectly of p at x=xx 
ppci1[i] <- ppci[1]; ppci2[i] <- ppci[2] 
ppcpdr1[i] <- ppcpdr[1]; ppcpdr2[i] <- ppcpdr[2] 
} 


Xmat <- cbind(1,xxv) 

etahat «- Xmat 96*96 fitScoef # NB: fit was created in (a) 

pihat «- 1/(1+exp(-etahat)) 

etahatvar<- diag ( Xmat %*% summary(fit)Scov.unscaled %*% t(Xmat) ) 
df <- length(yvdata)-length(fitScoef) # 10-2=8 

etahatub <- etahat + qt(0.975,df) * sqrt(etahatvar) 

etahatlb «- etahat - qt(0.975,df) * sqrt(etahatvar) 

pihatub «- 1/(1+exp(-etahatub)) 

pihatlb «- 1/(1+exp(-etahatlb)) 


X11(w=8,h=5); par(mfrowzc(1,1)) 
plot(c(0,15),c(0, 1), typez"n",xlabz"x", ylab="probability p(x)") 
points(xvdata,pvdata,pch=16);  lines(xxv,ppv) 
lines(xxv,ppci1,lwd=2); lines(xxv,ppci2,lwd=2) 
lines(xxv, ppcpdr1,lty=2,lwd=2); lines(xxv, ppcpdr2,lty=2,lwd=2) 
points(xxv, pihat); lines(xxv,pihatlb,Ityz4); lines(xxv,pihatub,Ityz4) 
legend(8,0.65, c("MC est & 95% CI","9596 CPDR","Classical GLM 95% CI"), 
Ity=c(1,2,4)) 

legend(8,0.35,c("Sample proportions","Standard GLM estimates"), pch=c(16,1)) 
# pphatv <- 1/(1+exp(-ahat-bhat*xxv)) 
# lines(xxv,pphatv,lty=3) # This alternative estimate is practically 

# indistinguishable from ppv and so is not plotted 
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# (f) K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K 


p5v <- 1/(1+exp(-av-bv*5)); set.seed(331); dv <- rbinom(J,20,p5v) 
hist(dv,prob=T,breaks=seq(-0.5,20.5,1)) 

summary(as.factor(dv)) 

#3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 

#1 3 20 75 217 472 845 1188 1562 1733 1546 1123 709 332 131 37 6 


dhat <- mean(dv); dci <- dhat + c(-1,1)*qnorm(0.975)*sqrt(var(dv)/J) 
dcpdr <- quantile(dv,c(0.025,0.975)) 
c(dhat,dci,dcpdr) # 11.81 11.76 11.85 7.00 16.00 


dv2 <- dv[dv>=7]; dv3 <- dv2[dv2<=16]; length(dv3)/J # 0.9727 
dv2 <- dv[dv>=8]; dv3 <- dv2[dv2<=16]; length(dv3)/J # 0.951 OK (>= 95%) 
dv2 <- dv[dv>=7]; dv3 <- dv2[dv2<=15]; length(dv3)/J # 0.9395 (too small) 


dhat2 <- mean(p5v) # alternative method 
qbinom(c(0.025,0.975),20,dhat2) #716 


A FES SIOUREUEOERUEOEUR HUEGEUUEOE TR RUECESUEOE AS QUEGUAUEOE HUBER AUOEAR IUEGEOR OE IER 


Lv «- -av/bv; Lhat «- mean(Lv); Lci «- Lhat + c(-1,1)*qnorm(0.975)*sqrt(var(Lv)/J) 
Lcpdr «- quantile(Lv,c(0.025,0.975)) 

c(Lhat,Lci,Lcpdr) # 4.279 4.273 4.286 3.584 4.916 

cf <- coef(fit); Lmle <- -cf[1]/cf[2]; deriv <- c( -1/cf[2] , cf[1]/cf[2]^2 ) 

Lvar <- t(deriv) 96*96 summary(fit)Scov.unscaled %*% deriv 

Lci2 «- Lmle * c(-1,1)*qt(0.975,8) * sqrt(Lvar) 

c(Lmle,Lci2) # 4.287 3.532 5.042 


AE (ify) SUE ee Pe E E coke ax uox ac doo aco ax coc aedis uoc al ee oai a 


xvdata «- c(0.1,1.4,3.6,3.8,5.2,6.1,8.7,9.1,9.1,13.6) 
yvdata «- c(1,0,23,12,8,32,10,35,19,8) 

nvdata «- c(10,30,40,20,15,46,12,37,23,8) 

pvdata «- yvdata/nvdata 
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MHLRZC <- function(burn,J,aO,bO, xv, yv,nv,sa,sb)( 


# Performs the Metropolis-Hastings algorithm for a logistic regression model 
# modified to have a zero constraint. 


H Inputs: burn = number of iterations for burn-in 

# J = required number of Monte Carlo simulations 

# a0 = starting value of alpha 

# bO = starting value of beta 

# xv = vector of xi values (length n) 

# yv = vector of yi observations 

# nv = vector of ni values 

# sa, sb = standard deviations of the two normal driver fns. 


# Outputs: Sav = vector of (burntJ+1) values of alpha (incl. starting value) 
H Sbv = vector of (burn+J+1) values of beta (incl. starting value) 
H Sara = acceptance rate for alpha (over last J iterations) 

# Sarb = acceptance rate for beta. 


logfun <- function(a,b,xv,yv,nv){ 
phatv <- 1- exp( -a*xv - b*xv^2 ) # The main change is here 
sum( yv*log(phatv) + (nv-yv)*log(1-phatv) ) } 
n «-length(yv); a<-a0; b<- bO 
its <- burn +J # total number of iterations 
av <- c(a, rep(NA,its)); bv <- c(b, rep(NA,its)) # vectors of simulated a & b 
values 
arav <- c(NA, rep(O,its)); arbv <- c(NA, rep(0,its)) 
# acceptance rate vectors for a and b 
for(j in 1:its){ 
a2 «- rnorm(1,a,sa) 
if(a2 > 0){ 
logpr <- logfun(a=a2,b=b,xv=xv,yv=yv,nv=nv)- 
logfun(a=a,b=b, xv=xv,yv=yv,nv=nv) 
pr <- exp(logpr); u <- runif(1) 
if(u<pr){ a <- a2; arav[j+1] <- 1) 
) 
b2 «- rnorm(1,b,sb) 
if(b2 > O){ 
logpr <- logfun(a=a,b=b2, xv=xv,yv=yv,nv=nv)- 
logfun(a=a,b=b, xv=xv,yv=yv,nv=nv) 
pr <- exp(logpr); u <- runif(1) 
if(u<pr){ b <- b2; arbv[j+1] <- 1) 
) 
av[j+1] <- a; bv[j+1] <- b } 


353 


Bayesian Methods for Statistical Analysis 


ara «- sum(arav[(burn+2):(its+1)])/J 
arb <- sum(arbv[(burn*2):(its-1)])/] # acceptance rates for a & b 
list(av=av, bv=bv,ara=ara,arb=arb) 
} 
burn <- 500; J <- 10000; its <- burn + J; set.seed(111) 
res <- MHLRZC(burn=burn,J=J,a0=0.1,b0=0.01, 
xv=xvdata, yv=yvdata,nv=nvdata,sa=0.03,sb=0.005) 
c(resSara,resSarb) # 0.5686 0.5637 OK 
par(mfrow=c(2,1)); plot(resSav,type="I"); plot(resSbv,type="I") # OK 
options(digits=4) 
av <- resSav[-(1:(burn+1))]; ahat <- mean(av) 
aci «- ahat + c(-1,1) * qnorm(1-0.05/2)*sqrt(var(av)/J) 
acpdr «- quantile(av,c(0.025,0.975)) 
c(ahat,aci,acpdr) # 0.10921 0.10842 0.11000 0.03622 0.19256 


bv «- resSbv[-(1:(burn+1))]; bhat «- mean(bv) 

bci «- bhat + c(-1,1) * qnorm(1-0.05/2)*sqrt(var(bv)/J) 

bcpdr «- quantile(bv,c(0.025,0.975)) 

c(bhat,bci,bcpdr) # 0.016683 0.016552 0.016814 0.003641 0.029898 


xxv «- seq(0,15,1); len «- length(xxv) 
ppv «- xxv; ppcil «- xxv; ppci2 <- xxv; ppcpdr1 <- xxv; ppcpdr2 «- xxv 


for(i in 1:len){ 
xx <- xxv[i] 
ppsim <- 1-exp(-av*xx-bv*xx^2) 
pp «- mean(ppsim) 
ppci <- pp + c(-1,1)*qnorm(0.975)*sqrt(var(ppsim)/J) 
ppcpdr «- quantile(ppsim,c(0.025,0.975)) 
ppv[i] <- pp # MC estimate of E(p|xx) and so indirectly of p at x=xx 
ppci1[i] <- ppci[1]; ppci2[i] <- ppci[2] 
ppcpdr1[i] <- ppcpdr[1]; ppcpdr2[i] <- ppcpdr[2] 
} 


X11(w=8,h=5); par(mfrow=c(1,1)) 
plot(c(0,15),c(0,1),type="n"",xlab="x",ylab="probability p(x)") 
points(xvdata,pvdata,pch=16);  lines(xxv,ppv) 
lines(xxv,ppci1,Iwdz2); lines(xxv,ppci2,lwd=2) 


lines(xxv,ppcpdr1,Ityz2,Iwdz2); lines(xxv,ppcpdr2,Ity=2,|wd=2) 


legend(8,0.6, c("MC est & 95% CI","9596 CPDR"),Ityzc(1,2)) 
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Exercise 7.4 Autocorrelated Bernoulli data (and practice at 
various statistical techniques) 


Consider the following Bayesian model for a sequence of identically 
distributed but possibly dependent and serially autocorrelated Bernoulli 
random variables y, : 

Ly, [8,0 Yis Yis Y: i) ~ Bernonth( p); i 0, T 2. 
7 1 
i Lt exp{—(a + by, )) 

f(a,b)x1, a,bem. 
Suppose that the data is y = (y,,..., y,) = (1,1,1,1,1, 1,1,0,0,0). 


i 


Use the Metropolis-Hastings algorithm to generate a random sample of 
J = 10,000 values from the joint posterior distribution of a and b. Use this 
sample to estimate the posterior means and 9596 CPDRs for a and b. Also 
estimate P(b « 0| y). 


Solution to Exercise 7.4 


The first thing we need to do is work out the probability that Y, —1 
conditional on a and b but not conditional on y, (since y, is not known). 


With an implicit conditioning on a and b, observe by the law of total 
probability that 
P(Y, 21) = P(Y, = 0)P(Y, =1| ¥, 20) + P(Y, = YP, —1|Y, —1) 
={1- PŒ = DIP =11¥, = 0) + P = DPE = 1%, =D. 


Solving for P(Y, =1), we get 
1+exp(a+b 
q, = PCY, =1a,b)=—_— tre F) _ 
2+ exp(a 4- b) J- exp(—a) 
1 

1+exp(—a — by, 1) 

(as already defined), the joint posterior pdf of a and b is 
f(a,b|y)œ f(a,b) f(y |a, b) 


a,b n n EU 
x1x f, lab] [flab y.) 2 a*a-a) ^ [[p*a- p)". 


i=2 i=2 


Hence, with p, = P(Y, 2 1|a,b, y, ,)= 
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So the log of the posterior density is given by 
I(a,b) = log f (a,b| y) 
=c + y,logq, 4 (1— y,)log(1— q;) 


- (y, log p; +- y)log(1 — p,)}. 
i=2 


Using normal drivers for both a and b, we implement a Metropolis- 
Hastings algorithm and thereby, following a burn-in of size B = 1,000, 
obtain an approximately random MC sample of size J = 10,000, which we 
will denote by 

(a,,b,)~iid f(a,b| y), j=1...,J. 


From this MC sample we estimate a by —2.337 with 95% CPDR 
(—6.3980, 0.8313), and b by 5.411 with 95% CPDR (0.9098, 11.8691). 
We also estimate P(b « 0| y) by 0.081. 


The traces of a and b over all 11,000 iterations, and histograms of the last 
10,000 values of a and b, respectively, are shown in Figure 7.11, together 
with posterior density estimates. 


Note: In an earlier exercise we considered a posterior predictive p- 
value for the null hypothesis that the sequence in the present exercise 
consists of values that are iid. 


That p-value was estimated as 0.0995 with 9596 CI (0.0936, 0.1054). 
The estimate 0.081 of P(b«0|y) in the present exercise may be 
interpreted in a similar way to the p-value 0.0995. 


In this case the appropriate p-value is one-sided. 
If we wish to do a two-sided test, in the present context, b = 0 versus 


b 0, then the p-value may be calculated as twice the minimum of 
P(b «0| y) and P(b»0]| y). 


Clearly, if the posterior distribution of b is well above or well below 
zero, then the resulting two-sided p-value will appropriately be very 
close to zero. 
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Figure 7.11 Traces and histograms for a and b 
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R Code for Exercise 7.4 
yv <-c(1,1,1,1,1, 1,1,0,0,0); n <- length(yv); ybar <- mean(yv); ydot <- sum(yv) 


MHBD <- function(K,a,b,yv,sa,sb)( 
# Performs a Metropolis-Hastings algorithm for a binary dependence model. 


H Inputs: K = total number of iterations 

H a,b = starting values of a and b 

H yv = vector of O-or-1 values (y1,...,yn) 

H sa, sb = standard deviations of the two normal driver fns. 
# Outputs: Sav = vector of (K+1) values of a (incl. starting value) 

H Sbv = vector of (K+1) values of b (incl. starting value) 

# Sara, Sarb = acceptance rates for a and b. 


n <- length(yv); av <- a; bv <- b; cta <- 0; ctb <-0 

logfun <- function(a,b,yv,n){ 
p1 = (1 + exp(atb)) / (2 + exp(at+b) + exp(-a)) #pl 
p2ton «- 1/(1 + exp(-a-b*yv[-n])) # p2,...,pn 
pv <- c(p1,p2ton) # p1,...,pn 
sum( yv*log(pv) + (1-yv)*log(1-pv)) — ) 
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for(j in 1:K){ 
a2 <- rnorm(1,a,sa) # proposed value of a 


logpr <- logfun(a=a2,b=b, yv=yv,n=n)-logfun(a=a,b=b, yv=yv,n=n) 


pr <- exp(logpr); u <- runif(1) 
if(u<pr){ a <- a2; cta<-cta+1} 
if(sb > O){ 
b2 <- rnorm(1,b,sb) # proposed value of b 


logpr <- logfun(a=a,b=b2,yv=yv,n=n)-logfun(a=a,b=b,yv=yv,n=n) 


pr <- exp(logpr); u <- runif(1) 
if(u<pr){ b <- b2; ctb <- ctb +1} 
} 
av <- c(av,a); bv <- c(bv,b) 
} 
list(av=av, bv=bv,ara=cta/K,arb=ctb/K) 
} 


options(digits=4); set.seed(143); date() # 


res <- MHBD(K=11000,a=0,b=0,yv=yv,sa=1.5,sb=2.2); date() # Took 2 secs 
c(resSara,resSarb) #0.5575 0.5753 (acceptance rates for a and b) OK 


X11(w=8,h=6); par(mfrow=c(2,1)); plot(resSav); plot(resSbv) # OK 
av «- res$av[1002:11001]; bv <- resSbv[1002:11001]; J=1000 


abar <- mean(av); bbar <- mean(bv); 
acpdr <- quantile(av,c(0.025,0.975)); 
bcpdr <- quantile(bv,c(0.025,0.975)) 


rbind(c(abar,acpdr),c(bbar,bcpdr)) 
# [1,] -2.337 -6.3980 0.8313 
# [2,] 5.411 0.9098 11.8691 


pr <- length(bv[bv«0])/J; pr # 0.081 
X11(w=8,h=6); par(mfrowzc(2,2)); 


Xlabz"j",ylabz"a j",cexz1.2) 
Xlabz"j" ylabz"b j",cexz1.2) 


plot(av,type- 
plot(bv,type= 


hist(av,prob=T,xlab="a",ylab="relative frequency",cex=1.2); 


abline(v=c(abar,acpdr), Ity=1,lwd=3); lines(density(av),lwd=2) 


hist(bv,prob=T,xlab="b",ylab="relative frequency", cex=1.2); 


abline(v=c(bbar,bcpdr), Ity=1,lwd=3); lines(density(bv),lwd=2) 
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Exercise 7.5 Inference on the bounds of a uniform distribution 


Consider the following Bayesian model: 
(YoY, |a, b) ~ iid U(a,b) 
(a |b) ^ U(0,b) 
b ~ U (0,1). 


Generate a random sample of size n = 20 from the model with a = 0.6 and 
b = 0.8. Then apply MCMC methods to generate a random sample from 
the joint posterior of a and b. Then use this sample to perform Monte Carlo 
inference on m = E(y, | a,b) = (a +b)/ 2. 


Solution to Exercise 7.5 


Rounding to four decimals, the generated sample values are as shown in 
Table 7.6. 


Table 7.6 Sample values 


i 1 2 3 4 5 
y, 0.7846 0.7572 0.6381 0.7626 0.6105 


i 6 7 8 9 10 
Y; 0.6990 0.7728 0.7113 0.7314 0.7435 


i 11 12 13 14 15 
y, 0.6324 0.7072 0.7493 0.7979 0.6182 


i 16 17 18 19 20 
Y; 0.7652 0.7883 0.7194 0.6211 0.6054 


Note: The range of this data is from 0.6054 to 0.7979. This tells us 
immediately that 0 < a < 0.6054 and 0.7979 <b <1. 
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Now, the joint posterior density of a and b is 
f(a,b| y) « f (a,b, y) — f(b) f (a|b) f Cy | a, b) 
E: eS) cou TTE M « b) 


1 b ien b—a 
M 0<a<b<1, a<miny, «max y, «b. 
b(b — a)" 
So the two conditional posterior distributions are defined by: 
1 
f (a | y,b) x -, O<a<min(y,) 
(b —a) 
1 
b| y,a)x —————, max(y)«b«l. 
fl y.a) xg e AV) 


Neither of these conditionals defines a well-known distribution. So we 
will apply a ‘pure’ Metropolis-Hastings algorithm (rather than a Gibbs 
sampler). 


With a' and b' denoting the proposed values of a and b, the acceptance 
probabilities at the two steps are: 


— f(a'|y,b) 1/(b-ay -(Es] 
" f(a|yb) 1/(b-ay |b-a' 


_ fb'|y,a) 1i/(b(b—a)) b p-a] 


P fla) 1/@b—a)") “Fie 
The following drivers were chosen: 
a' ~ N(a,r?) 
b'~ N(b,t’). 
Starting at a = 0.1 and b = 0.9, and using the tuning constants r = 0.008 


and t = 0.01, the algorithm was run for 2,500 iterations. The resulting trace 
plots are shown in Figure 7.12. 


We see that stochastic convergence was achieved within 500 iterations. 


The acceptance rates over the last 2,000 iterations were 6296 and 5896 for 
a and b, respectively. 
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Figure 7.12 Traces for a and b 
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The algorithm was then run for a further 50,000 iterations, starting at the 
last values in the previous run (a = 0.5979 and b = 0.8123). The acceptance 
rates were now 6196 and 5496, and this second run took 14 seconds of 
computer time. 


Then every 50th value was recorded so as to yield a final random sample 
of size J = 1,000 from the joint posterior distribution of a and b, i.e. 


(a,b), (0,5, ) iid f(a,b| y). 
As a check, the sample ACF of each sample of size 1,000 was calculated. 


Figure 7.13 shows the ACF estimates for a and b, and these provide no 
evidence for residual autocorrelation in either series. 
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Figure 7.13 Sample ACFs for a and b 
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A random sample from the posterior distribution of the mean 
m=E(y,|a,b)=(a+b)/2 

was then formed by calculating 
m, =(a,+b,)/2. 


We thereby obtained the random sample 
m,,...m, ~ iid f (m| y). 


This Monte Carlo sample was used to estimate m= E(m| y) by 0.7013, 


with 9596 CI (0.7008, 0.7019). The estimated 9596 CPDR for m was 
(0.6837, 0.7173). 


Figure 7.14 is a histogram of the 1,000 values of m, overlaid by a density 


estimate of f (m| y) , with the vertical lines showing the point and interval 
estimates reported above. 
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Figure 7.14 Inference on m = (a + b)/2 
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R Code for Problem 7.5 
options(digits=4) 


MH = function(B,J=1000,y,a,b,r,t){ 
# This function performs a Metropolis-Hastings algorithm for a model involving 
3 uniforms. 
# Inputs: B = burn-in length 
# J = desired Monte Carlo size 
y = (y1,...,yn) = data (yi ~ iid U(a,b)) 
a = starting value of a (a ~ U(0,b)) 
b = starting value of b (b ~ U(0,1)) 
r,t = tuning constants for a & b, respectively 
Outputs: Sav = (1+B+J) vector of a-values 
Sbv = (1+b+J) vector of b-values 
Sar = acceptance rate for a (over last J iterations) 
Sbr = acceptance rate for b (over last J iterations) 
av = a; bv = b; an=0; bn=0; miny=min(y); maxy=max(y); n=length(y); 
for(j in 1:(B+J)){ 
ap = rnorm(1,a,r) 
if((O<ap)&&(ap<miny)){ 
p = ((b-a)/(b-ap))^n; u = runif(1) 
if(u«p)( azap; if(j>B) an=an+1} } 
bp = rnorm(1,b,t) 
if((maxy<bp)&&(bp<1)){ 
q = (b/bp)*((b-a)/(bp-a))^n; v = runif(1) 


TE dt ck Gk GR Gk HH 
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if(v<q){ b=bp; if(j>B) bn=bn+1} } 
av=c(av,a); bv=c(bv,b) 
} 


ar = an/J; br=bn/J; list(av=av, bv=bv, ar=ar, br=br) } 


set.seed(337); ydata = runif(20,0.6,0.8); round(ydata,4) 

# [1] 0.7846 0.7572 0.6381 0.7626 0.6105 0.6990 0.7728 0.7113 0.7314 
# [10] 0.7435 0.6324 0.7072 0.7493 0.7979 0.6182 0.7652 0.7883 0.7194 
# [19] 0.6211 0.6054 

summary(ydata) 

4 Min.1stQu. Median Mean 3rd Qu. Max. 

# 0.605 0.637 0.725 0.711 0.763 0.798 


B = 500; J = 2000; set.seed(232) 

mh = MH(B=B,J=J,y=ydata, a=0.1,b=0.9,r=0.008,t=0.01) 

c(mhSar,mhSbr) # 0.616 0.576 

X11(w=8,h=7); par(mfrow=c(2,1)) 

plot(0:(B+J),mhSav,type="I",main= 
abline(v=B,Ity=3) 

plot(0:(B+J),mhSbv,type="I", main="",xlab="j",ylab="bj") 
abline(vzB,Ityz3) 

alast= mhSav[length(mhSav)]; blast= mhSbv[length(mhSbv)] 

c(alast,blast) # 0.5979 0.8123 


Xlabz"j" ylabz"aj") 


B=0; J = 50000; set.seed(230); date() 

mh = MH(B=B,J=J,y=ydata, a=alast,b=blast,r=0.008,t=0.01) 

date() # Takes about 14 seconds 

c(mhSar,mhSbr) # 0.6141 0.5434 

av=mhSav[-1][seq(05,50000,50)]; J = length(av); J # 1000 
bv=mhSbv[-1][seq(50,50000,50)]; 

acf(av)Sacf[1:5] #1 0.04828 0.01193 -0.02745 0.03983 OK 
acf(bv)Sacf[1:5] # 1 0.038617 0.007026 0.030259 0.011678 OK 
mv=0.5*(av+bv) 

# acf(mv)Sacf[1:5] # 1 -0.001121 -0.020770 0.001872 -0.008731 OK 


X11(w=8,h=5); par(mfrow=c(1,1)) 
hist(mv,prob=T,xlab="m",main="", 

xlim=c(0.65,0.75), ylim=c(0,80)) 
lines(density(mv),lwd=2) 
est=mean(mv); ci=est+c(-1,1)*qnorm(0.975)*sd(mv)/sqrt(J) 
cpdr=quantile(mv,c(0.025,0.975)) 
print(c(est,ci,cpdr),digits=4) # 0.7013 0.7008 0.7019 0.6837 0.7173 
abline(v= c(est,ci,cpdr),lwd=2) 
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8.1 Introduction to BUGS 


We have illustrated the usefulness of MCMC methods by applying them 
to a variety of statistical contexts. In each case, specialised R code was 
used to implement the chosen method. Writing such code is typically time 
consuming and requires a great deal of attention to details such as 
choosing suitable tuning constants in the Metropolis-Hastings algorithm. 


A software package which can greatly assist with the application of 
MCMC methods is WinBUGS. This stands for: 
Bayesian Inference Using Gibbs Sampling for Microsoft Windows. 


The BUGS Project was started in 1989 by a team of statisticians in the 
UK (at the Medical Research Council Biostatistics Unit, Cambridge, and 
Imperial College School of Medicine, London) and developed until the 
latest version WinBUGS 1.4.3 was released in 2007. 


WinBUGS 1.4.3 is a stable version of BUGS which is suitable for routine 
use, even today. 


Since 2007, development of BUGS has focused on OpenBUGS, an open 

source version of the package. In what follows we will only refer to 

WinBUGS 1.4.3. This is freely available from the official website: 
http://www.mrc-bsu.cam.ac.uk/software/bugs/ 


Figure 8.1 shows this website (as it appeared on 18 February 2015). 


Figure 8.2 shows the Wikipedia article on WinBUGS (on the same day): 
http://en.wikipedia.org/wiki/WinBUGS 


The preferred reference for citing WinBUGS in scientific papers is: 


Lunn, D.J., Thomas, A., Best, N., and Spiegelhalter, D. (2000). 
WinBUGS - A Bayesian modelling framework: Concepts, 
structure, and extensibility. Statistics and Computing, 10: 
325-337. 
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Figure 8.1 Official website for WinBUGS 


MRC Biostatistics Unit 


ABOUT BSU RESEARCH & DEVELOPMENT PEOPLE PUBLISHED RESEARCH SOFTWARE TRAINING RECRUITMENT 


NEWS & EVENTS 


Home > Software > The BUGS Project 


Software Welcome 
The BUGS 
Project Background to BUGS 
Latest News The BUGS (Bayesian inference Using Gibbs Sampling) project is concerned with flexible 
software for the Bayesian analysis of complex statistical models using Markov chain Monte 
Contact us/BUGS Carlo (MCMC) methods. The project began in 1989 in the MRC Biostatistics Unit, 
list Cambridge, and led initially to the ' Classic BUGS program, and then onto the WinBUGS 


software developed jointly with the Imperial College School of Medicine at St Mary's, 
WinBUGS London. 


New WinBUGS 

examples Development is now focussed on the OpenBUGS project. 

The BUGS Book WinBUGS 1.4.3 

FAQs This site at the MRC Biostatistics Unit is primarily concerned with the stand-alone WinBUGS 
pie 1.4.3 package. 


Figure 8.2 Wikipedia article on WinBUGS 


Create account Log ir 


Article Talk Read Edit Viewhistory S=ar0h Q 


WIKIPEDIA WinBUGS 


The Free Encyclopedia From Wikipedia, the free encyclopedia 


Main page WinBUGS is statistical software for Bayesian analysis using WinBUGS 
Calttenis Markov chain Monte Carlo (MCMC) methods. 
Featured content i 
Current events It is based on the BUGS (Bayesian inference Using Gibbs 
Random article Sampling) project started in 1989. It runs under Microsoft 
Donate to Wikipedia Windows, though it can also be run on Linux using Wine.!"! BUGS 
Wikimedia Shop ? Developer(s) The BUGS Project 
It was developed by the BUGS Project, a team of UK Initial release 1997 
Interaction researchers at the MRC Biostatistics Unit, Cambridge, and Discontinued 14.3 / August 6. 2007: 7 
Help Imperial College School of Medicine, London, years ago 
About Wikipedia 
Community portal The last version of WinBUGS was version 1.4.3. released in Operating system Windows 
Recent changes August 2007. Development is now focused on OpenBUGS, an Available in English 
Contact page open source version of the package. WinBUGS 1.43 remains — TYPe Statistical package 
Tools available as a stable version for routine use, but is no longer License Freeware 
What links here being developed PII) 
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8.2 A first tutorial in BUGS 


Consider the following Bayesian model: 
Vises Y, 457 ~ iid Normal(u,o°) | (c 21/o?) 
u|c ~ Normal(4,0;) 
t ~ Gamma(a,g8) (Et=a/P) 

where 44 =0, 6; = 10,000 and a = £ = 0.001. 


Suppose the data is y =(y,,...,y,) = (2.4, 1.2, 5.3, 1.1, 3.9, 2.0), and we 
wish to find the posterior mean and 95% posterior interval for each of w 


and y — "A (the signal to noise ratio). 


To perform this in WinBUGS 1.4.3, open a new window (select ‘File’ and 
then *New' in the BUGS toolbar), and type the following BUGS code: 


model 


{ 
for(i in 1:n){ 
yli] ~ dnorm(mu, tau) 
} 
mu ~ dnorm(0,0.0001) 
tau ~ dgamma(0.001, 0.001) 
gam <- mu*sqrt(tau) 


} 
list( n=6, y=c(2.4,1.2,5.3,1.1,3.9,2.0) ) 


list(tau=1) 


Alternatively, copy this text from a Word document into a Notepad file, 
and then copy the text from the Notepad file into the WinBUGS window. 


Note: Do not copy text from Word to WinBUGS directly or you may 
get an error message. 


The WinBUGS window should then look as depicted in Figure 8.3. 
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Figure 8.3 WinBUGS window with code 
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B untitled3 aps) pd 


model a 


{ 

for(i in 1:n){ y[i] ~ dnorm(mu, tau) } 
mu ~ dnorm(0,0.0001) 

tau ~ dgamma(0.001, 0.001) 
gam <- mu*sart(tau) 


} 
list( n=6, y=c(2.4,1.2,5.3,1.1,3.9,2.0) ) 
list(tau=1) 


Next, select ‘Model’ (in the WinBUGS toolbar) and then ‘Specification’. 


Then highlight the word ‘model’ (in the BUGS code above) and click on 
‘check model’ in the ‘Specification Tool’. 


Then highlight the first word ‘list’, click on ‘load data’ and click on 
‘compile’. 


Then highlight the second word ‘list’, click on ‘load inits’ and click on 
‘gen inits'. 

Next, select ‘Inference’ and then ‘Samples’. Then, in the ‘Sample Monitor 
Tool’ which appears, type ‘mu’ in the ‘node’ box, click ‘set’, type ‘gam’ 
in the ‘node’ box and click ‘set’ again. 


Then click ‘Model’ and ‘Update’. 


In the ‘Update Tool’ which appears, change ‘1000’ to ‘1500’ and click 
‘update’. This will implement 1,500 iterations of an MCMC algorithm. 


Next type ‘*’ (an asterisk) in the ‘node’ box, change ‘1’ to ‘501’ in the 
‘beg’ box (meaning beginning) and click ‘stats’ (statistics). 


This should produce something similar to what is shown in Figure 8.4 and 
Table 8.1. 
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Figure 8.4 Tools and node statistics in WinBUGS 
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[^ over relax [^ adapting 


“node mean sd MC error2.5% median 97.5% start sample 
gam 1.538 0.63890.021130.37751.5212.908501 1000 
mu 2.636 0.81810.025870.94282.6454.313501 1000 


Table 8.1 Node statistics in WinBUGS (as in Figure 8.4) 


node mean sd MC error 2.596 median 97.5% start sample 
gam 1.538 0.6389 0.02113 0.3775 1.521 2.908 501 1000 
mu 2.636 0.8181 0.02587 0.9428 2.645 4.313 501 1000 


From these results, we see that the posterior mean and 9596 posterior 
interval for „u are about 2.64 and (0.94, 4.31), and the same quantities 


for y are about 1.54 and (0.38, 2.91). 


To obtain more precise inference we could repeat the above procedure 
with a larger Monte Carlo sample size (e.g. 10,000 rather than 1,000). 
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Note: If o=% and a = fj 20, the posterior mean and 9596 CPDR for 
u are exactly 
y 72.65 
(i.e. the sample mean) and 
(yt, (n—1)/An) = (0.92, 4.38) 
(where s is the sample standard deviation). 


The posterior mean and CPDR for y do not have such simple formulae. 


To see line plots of the simulated values, click on ‘history’ (in the ‘Sample 
Monitor Tool’), and to view smoothed histograms of them, click ‘density’. 
Figure 8.5 illustrates. 


Figure 8.5 Line plots and smoothed histograms in WinBUGS 
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To transfer the simulated values from WinBUGS into R (for further 
analysis) click on ‘coda’. Two boxes will appear, one called ‘CODA index’ 
with the following: 


gam 1 1000 


mu 1001 2000 


The other box, called ‘CODA for chain 1’, should have two columns and 
2,000 rows and look as follows: 


501 1.298 
502 1.307 
503 1.478 
1498 0.8303 
1499 1.993 
1500 2.326 
501 1.812 
502 1.999 
503 2.8 
1498 1.628 
1499 2.161 
1500 2.748 


Next, copy the contents of ‘CODA for chain 1’ into a Notepad file called 
‘out.txt’ (say). Save that file somewhere, e.g. onto the desktop. 
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Then begin a session in R and proceed as follows: 


out «- read.table(file=file.choose()) # Navigate to and choose ‘out.txt’ 
dim(out) #2000 2 
gamv «- out[1:1000,2]; muv «- out[1001:2000,2] 


par(mfrow=c(2,1)); hist(muv, breaks=20); hist(gamv, breaks=20) 


This should result in the graphs shown in Figure 8.6. 


Figure 8.6 Histograms in R using output from WinBUGS 
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One can then use the MCMC output in many other ways, e.g. to simulate 
from a posterior predictive distribution via the method of composition. 


As an alternative, it is possible to run WinBUGS directly from R after 
installing the appropriate packages. (This will be done in a future exercise). 
But this method is really only for production runs and is not recommended 
during the experimentation stage of an analysis. 


For more information on BUGS, click on ‘Help’ and ‘User manual’ in the 
toolbar. Also see ‘Examples Vol I’ and ‘Examples Vol II’ for several 
dozen worked examples in BUGS. The examples are very user-friendly. 
They contain data, code and everything one needs to reproduce the results 
shown. Figure 8.7 shows various excerpts from these files. 


Figure 8.7 Exerpts from the WinBUGS 1.4.3 User Manual 


(several pages) 


ies WinBUGS User Manual 


BUGS 
Version 1.4, January 2003 
Upgraded to: 
Version 1.4.3 (please see here for details) 
Auaust 6th. 2007 


Beware: MCMC sampling can be dangerous! 


Contents 

Introduction >< Compound Documents >< 

Model Specification >< DoodleBUGS: The Doodle Editor >+ 

The Model Menu >< The Inference Menu >< 

The Info Menu >< The Options Menu >< 

Batch-mode: Scripts Tricks: Advanced Use of the BUGS Language >< 
WinBUGS Graphics >< Tips and Troubleshooting >< 

Tutorial >< Changing MCMC Defaults (advanced users only) >= 
Distributions >< References 
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2: Distributions 


aat e: Distributions 
X 
à 
% LM 
BUGS 
Contents 
Discrete Univariate c 
Continuous Univariate c 
Discrete Multivariate c 
Continuous Multivariate c 
Discrete Univariate [:p|nome] 
Bernoulli 
r " dbern(p) p(1-p)7"5 r=0,1 
Binomial 
r " dbin(p, n) n! r 
— p" (1 -p "™; r-0,....n 
r!(n Zr)!” =p a 
Categorical 
r " dcat(p[]) pr] r-1,2,.,dim(p; 5»i|-1 
Negative Binomial 
x " dnegbin(p, r) (rcr 1)! E 
1—-p)*; r20,12,.. 
aT (1 — p) : 
Poisson 
" dpois(lambd AT 
r pois(lambda) -a2 í2:0,1, 
r! 
Continuous Univariate [top| home] 
Beta 
p " dbeta(a, b) onl bi T(a + b) 
1-; ———; O0<p<l 
p*—(1— p) Tar)’ p 
Chi-squared 
x " dchisqr(k) 9—k/2, k/2—1,o—r/2 
hurr 
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&] Examples Volume! ferent 


iv ixl 


BUGS Examples Volume 1 


Rats: Normal hierarchical model 


Pump: conjugate gamma-Poisson hierarchical model 


Dogs: log linear binary model 


Seeds: random effects logistic regression 


Surgical: institutional ranking 


Salm: extra-Poisson variation in dose-response study 


Equiv: bioequivalence in a cross-over trial 


Dyes: variance components model 

Stacks: robust and ridge regression 

Epil: repeated measures on Poisson counts 

Blocker: random effects meta-analysis of clinical trials 

Oxford: smooth fit to | s ratios in case control studies 

LSAT: latent variable models for item-response data 

Bones: latent trait model for multiple ordered catagorical responses 
Inhalers: random effects model for ordinal responses from a cross-over trial 
Mice: Weibull regression in censored survival analysis 

Kidney: Weibull regression with random effects 

Leuk: survival analysis using Cox regression 


Cox regression with frailties 


References: 

Sorry - an on-line version of the references is currently unavailable. 
Please refer to the existing Examples documentation available from 
http://www .mrc—bsu.cam.ac.uk/“bugs. 
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LEA 


sues Rats: a normal hierarchical model 


This example is taken from section 6 of Gelfand ef a/ (1990), and concerns 30 young rats whose weights were 
measured weekly for five weeks. Part of the data is shown below, where Y;is the weight of the ith rat measured 


at age xj. 


Weights Y; of rat i on day x; 
X-8 15 22 29 36 


A plot of the 30 growth curves suggests some evidence of downward curvature. 


The model is essentially a random effects linear growth curve 
Y~ Normal(a;* B(X;- Xpar), t.) 


a, ~ Normal(a, Ta) 


Bi ~ Normal(p, Tp) 


where Xp,, = 22, and T represents the precision (1/variance) of a normal distribution. We note the absence of a 
parameter representing correlation between oc; and B; unlike in Gelfand et al 1990. However, see the Birats 
example in Volume 2 which does explicitly model the covariance between œ; and B;. For now, we standardise 
the xs around their mean to reduce dependence between 0; and B; in their likelihood: in fact for the full 


balanced data, complete independence is achieved. (Note that, in general, prior independence does not force 
the posterior distributions to be independent). 


tes Tis Be Tg, Tc are given independent "noninformative" priors, with two alternatives considered for Tọ and 
Tg: prior 1 is uniform on the scale of the standard deviations Og = 1/sqrt(T,,) and Gg = 1/sqrt(Tg), and prior 2 is 
a gamma(0.001, 0.001) on the precisions T,, and t. Interest particularly focuses on the intercept at zero time 
(birth), denoted à, = Ot, - Bc Xpar- 


Graphical model for rats example (using prior 1): 
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Note: The last graphic shown is called a Doodle. WinBUGS has a 
facility whereby the user can create such a diagram and have the code 
generated automatically. 


BUGS language for rats example: 


model 


for(iin 1: N)£ 
for(jin 1: T)1 
Y[i , j] ~ dnorm(mufi , j],tau.c) 
mufi , j] <- alpha[i] + beta[i] * (x[j] - xbar) 


alpha[i] ~ dnorm(alpha.c,tau.alpha) 
beta[i] ~ dnorm(beta.c,tau.beta) 


} 

tau.c ~ dgamma(0.001,0.001) 

sigma <- 1 / sqrt(tau.c) 

alpha.c ~ dnorm(0.0,1.0E-6) 

# Choice of prior of random effects variances 
# Prior 1: uniform on SD 

sigma.alpha~ dunif(0,100) 

sigma.beta~ dunif(0,100) 
tau.alpha<-1/(sigma.alpha*sigma.alpha) 
tau.beta<-1/(sigma.beta*sigma.beta) 


#Prior 2: (not recommended) 
#tau.alpha ~ dgamma(0.001,0.001) 
&tau.beta ~ dgamma(0.001,0.001) 
beta.c ~ dnorm(0.0,1.0E-6) 


alphaO <- alpha.c - xbar * beta.c 
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Data >list(x = c(8.0, 15.0, 22.0, 29.0, 36.0), xbar = 22, N = 30, T = 5, 
Y = structure( 

.Data - c(151, 199, 246, 283, 320, 
145, 199, 249, 293, 354, 
147, 214, 263, 312, 328, 
155, 200, 237, 272, 297, 
135, 188, 230, 280, 323, 
159, 210, 252, 298, 331, 
141, 189, 231, 275, 305, 
159, 201, 248, 297, 338, 
177, 236, 285, 350, 376, 
134, 182, 220, 260, 296, 
160, 208, 261, 313, 352, 
143, 188, 220, 273, 314, 
154, 200, 244, 289, 325, 
171, 221, 270, 326, 358, 
163, 216, 242, 281, 312, 
160, 207, 248, 288, 324, 
142, 187, 234, 280, 316, 
156, 203, 243, 283, 317, 
157, 212, 259, 307, 336, 
152, 203, 246, 286, 321, 
154, 205, 253, 298, 334, 
139, 190, 225, 267, 302, 
146, 191, 229, 272, 302, 
157, 211, 250, 285, 323, 
132, 185, 237, 286, 331, 
160, 207, 257, 303, 345, 
169, 216, 261, 295, 333, 
157, 205, 248, 289, 316, 
137, 180, 219, 258, 291, 
153, 200, 244, 286, 324), 

.Dim = c(30,5))) 


Inits 1 >list(alpha = c(250, 250, 250, 250, 250, 250, 250, 250, 250, 250, 250, 250, 250, 250, 250, 
250, 250, 250, 250, 250, 250, 250, 250, 250, 250, 250, 250, 250, 250, 250), 
beta = c(6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 
6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6), 
alpha.c = 150, beta.c = 10, 


tau.c = 1, sigma.alpha = 1, sigma.beta = 1) 


Results 


A 1000 update burn in followed by a further 10000 updates gave the parameter estimates: 


node mean sd MC error 2.5% median 97.5% start sample 
alphad 106.6 3.65 0.04151 99.43 106.5 113.9 1001 10000 
beta.c 6.185 0.1102 0.001294 5.967 6.185 6.404 1001 10000 
sigma 6.074 0.4673 0.007724 5.247 6.044 7.068 1001 10000 
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| node | mean 
Y[26,2] 204.5 
Y[26,3] 250.0 
Y[26,4] 295.4 
Y[26,5] 340.6 
beta.c 6.575 
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MC error | 2.5% median | 97.5% start sample 
0.1159 187.0 204.4 221.7 1001 10000 
0.1642 229.7 249.9 270.1 1001 10000 
0.2092 270.3 295.3 320.3 1001 10000 
0.284 310.2 340.5 370.5 1001 10000 
0.003708 | 6.281 6.573 6.875 1001 10000 


alpha[30] 
260.0 
250.0 
240.0 
230.0 


alpha[29] sample: 10000 


0.15 
0.1 
0.05 
0.0 
200.0 210.0 
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20 


220.0 


1.5 
1.0 
0.5 
0.0 
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(End of Figure 8.7) 
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Exercise 8.1 Simple linear regression via WinBUGS 


Use WinBUGS to perform a simple linear regression on the data in Table 
8.2 (which is the same as Table 7.1 in Exercise 7.2). 


Table 8.2 Regression data 


x (=i) 1 2 3 4 5 
5.879 8.54 14.12 13.14 15.26 


i 6 7 8 9 10 
y, 20.43 19.92 18.47 21.63 24.11 


Solution to Exercise 8.1 


Using the following WinBUGS code, we obtain the results in Table 8.3: 


model( 
for(i in 1:n)( 
muli] <- a + b*x[i] 
y[i] ~ dnorm(mu[i],lam) 
) 
a ~ dnorm(0.0,0.001) 
b ^ dnorm(0.0,0.001) 
lam ~ dgamma(0.001,0.001) 
} 


# data 
list(n = 10, x = c(1,2,3,4,5,6,7,8,9,10), y=c(5.879,8.54,14.12, 
13.14,15.26,20.43,19.92,18.47,21.63,24.11)) 


# inits 
list(a=0,b=0,lam=1) 


Table 8.3 Results of regression performed using WinBUGS 


node mean sd MC error 2.5% median 97.596 start sample 
a 6.039 1.532 0.01646 2.955 6.051 9.107 1001 10000 
b 1.836 0.247 0.00266 1.342 1.834 2.334 1001 10000 


lam 0.2625 0.1313 0.001602 0.07259 0.2404 0.5788 1001 10000 
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Using the results in Table 8.3, we estimate a by 6.039 with 95% CPDR 
(2.955, 9.107), and we estimate b by 1.836 with 95% CPDR (1.342, 2.334). 


It may be noted that these results are very similar to those obtained via 
classical techniques in an earlier exercise: 6.051 and (2.973, 9.128) for a, 
and 1.836 and (1.340, 2.332) for b. 


Figure 8.8 shows trace plots and density estimates produced as part of the 
WinBUGS output. 


Figure 8.8 Graphical output from WinBUGS regression 
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Exercise 8.2 Logistic regression via WinBUGS 
Consider the data in Table 8.4, which is the same as in Table 7.2 of 


Exercise 7.3 (where, for example, in Experiment 3 a total of 40 rats were 
exposed to radiation for 3.6 hours, and 23 of them died within one month). 


Table 8.4 Rat mortality data 


i n, Xi Yi y,/ n; = B; 

1 10 0.1 1 1/10 = 0.1 
2 30 1.4 0 0/30 = 0 

3 40 3.6 23 23/40 0.575 
4 20 3.8 12 12/20 - 0.6 
5 15 5.2 8 8/15 = 0.5333 
6 46 6.1 32 32/46 = 0.696 
7 12 8.7 10 10/12 - 0.833 
8 37 9.1 35 35/37 0.946 
9 23 9.1 19 19/23 = 0.826 
10 8 13.6 8 8/8 = 1 


Use WinBUGS to estimate the parameters in the following logistic 
regression model for these data: 


Y, ~L Bin(n,,p,), i= 1,...,n, 
where: 
1 
= (probability of a ‘success’ for experiment i) 
1+ exp(—z,) 
gu DX, (linear predictor). 


In your results, also include inference on LD50, the dose at which 50% of 
rats will die (= —a/b), and on d, defined as the number of rats that will die 
out of 20 that are exposed to five hours of radiation. 
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Solution to Exercise 8.2 


Applying the following WinBUGS code, we obtain the results in Table 
8.5: 


model 
{ 
for(i in 1:N){ 
zli] <- a + b*x[i] 
logit(p[i])«- z[i] 
yli] ~ dbin(p[i],n[i]) 
} 
a ~ dnorm(0.0,0.001) 
b ^ dnorm(0.0,0.001) 
logit(p5) <- a*5*b 
d ~ dbin(p5,20) 
LD50 «- -a/b 
} 


# data 
list(N=10,n=c(10,30,40,20,15,46,12,37,23,8), 
x=c(0.1,1.4,3.6,3.8,5.2,6.1,8.7,9.1,9.1,13.6), 
y=c(1,0,23,12,8,32,10,35,19,8)) 


# inits 
list(a=0,b=0) 


Table 8.5 Results of logistic regression performed using 
WinBUGS 


nodemean sd MC error 2.5% median 97.5% start sample 


LD50 4.273 0.3373 0.00464 3.587 4.285 4.899 1001 10000 
a -2.177 0.3726 0.01041  -2.922 -2.168 -1.478 1001 10000 
b 0.5082 0.06962 0.001964 0.3794 0.5059 0.6501 1001 10000 
d 11.79 2.344 0.02447 7.0 12.0 16.0 1001 10000 
p5 0.5895 0.03946 3.17464 0.5125 0.5896 0.6664 1001 10000 


383 


Bayesian Methods for Statistical Analysis 


Thus, we estimate a by —2.177 with 9596 CPDR (-2.922, —1.478), etc. 


These results are very similar to those obtained via classical techniques in 
Exercise 7.3, namely —2.156 and (—3.000, —1.311) for a, etc. 


Figure 8.9 shows some traces and density estimates produced as part of 
the WinBUGS output. Here, ‘p5’ represents the probability of a rat dying 


within one month if exposed to five hours of radiation. We chose to 
monitor this node so as to estimate its posterior density 


Figure 8.9 Graphical output from WinBUGS logistic regression 
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Exercise 8.3 Inference on a uniform distribution via WinBUGS 


Consider the following Bayesian model: 
(Y-Y, | a,b) ^ iid U(a,b) 


(a|b) ~ U(0,b) 
b ~ U (0,1). 
Suppose that n = 20 data values from this model with a = 0.6 and 


b = 0.8 are as shown in Table 8.6 (which is the same as Table 7.6 in 
Exercise 7.5). 


Table 8.6 Sample values from a uniform distribution 


i 1 2 3 4 5 
Y; 0.7846 0.7572 0.6381 0.7626 0.6105 


i 6 7 8 9 10 
Y; 0.6990 0.7728 0.7113 0.7314 0.7435 


i 11 12 13 14 15 
Y; 0.6324 0.7072 0.7493 0.7979 0.6182 


i 16 17 18 19 20 
y, 0.7652 0.7883 0.7194 0.6211 0.6054 


Use WinBUGS to generate a random sample from the joint posterior 
distribution of the parameters a and b. Then use this sample to estimate 
the mean of the uniform distribution, namely 

m= E(y, | a,b) (a - b)/2. 
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Solution to Exercise 8.3 


Applying the following WinBUGS code we obtain the results in Table 8.7: 


model 

{ 

for(iin 1:n)( y[i] ~ dunif(a,b) } 
b ~ dunif(0,1) 

a ~ dunif(0,b) 

m <- (a*b)/2 

} 


list( n=20, y=c( 0.7846, 0.7572, 0.6381, 0.7626, 0.6105, 
0.6990, 0.7728, 0.7113, 0.7314, 0.7435, 
0.6324, 0.7072, 0.7493, 0.7979, 0.6182, 
0.7652, 0.7883, 0.7194, 0.6211, 0.6054) ) 


list(a=0.1, b=0.9) 


Table 8.7 Results of WinBUGS analysis for a uniform 
distribution 


node mean sd MC error 2.5% median 97.5% start sample 
a 0.594 0.01184 1.996E-4 0.5623 0.5977 0.6051 1001 10000 
b 0.8091 0.01187 2.004E-4 0.7982 0.8054 0.841 1001 10000 
m 0.7016 0.008201 1.388E-4 0.6844 0.7015 0.7187 1001 10000 


Using the results in Table 8.7, we estimate m by 0.7016, with 95% CI 
(0.7013, 0.7019) for m’s posterior mean. 


We also estimate the 95% CPDR for m as (0.6844, 0.7187). 
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Note 1: The CI here was obtained in R using the following code: 

0.7016 +c(-1,1)*qnorm(0.975)*0.0001388 
Another CI is (0.7014, 0.7018), obtained using the code: 

0.7016 +c(-1,1)*qnorm(0.975)*0.008201/sqrt( 10000) 
But this second CI is ‘inferior’ to (0.7013, 0.7019) because it ignores 
the autocorrelation in the simulated values. The fact that the second CI 


is shorter corresponds to the fact that its true coverage probability is less 
than the nominal and desired 9596. 


Note 2: These inferences (above Note 1) are similar to those obtained in 
the solution to Exercise 7.5 using custom-written R code: 0.7013 with 
9596 CT (0.7008, 0.7019) and 9596 CPDR estimate (0.6837, 0.7173). 


Note 3: The CI in Note 2 is wider than the CT (0.7013, 0.7019) because 
it is based on a smaller Monte Carlo sample size (of 1,000 rather than 
10,000). If we use only iterations 1,001 to 2,000 from the WinBUGS 
output, we get 

m 0.7016 0.008287 3.573E-4 0.6833 0.7016 0.7194 1001 1000 


in place of the corresponding row of Table 8.7. Then, the 9596 CI for 
m's posterior mean becomes (0.7009, 0.7023), obtained via 


0.7016 +c(-1,1)*qnorm(0.975)*0.0003573 


This CI has a width of 0.0014, which is greater than 0.0006, the width 
of (0.7013, 0.7019), and closer to 0.0011, the width of the CT in Note 2. 


Figure 8.10 shows some traces and density estimates produced as part of 
the WinBUGS output. 


387 


Bayesian Methods for Statistical Analysis 


Figure 8.10 Graphs from WinBUGS analysis for a uniform 
distribution 
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8.3 Tutorial on calling BUGS in R 


The following is a short tutorial on how WinBUGS can be called within 
an R session. Some of the details may need to be changed depending on 
the configuration of files and directories in the computer being used. 


First, assume that R (v3.01) is installed in C:/R-3.0.1 
Also assume that WinBUGS (v4.1.3) is installed in C:/WinBUGS14 
Open R and type 


install.packages("R2WinBUGS") 


Note: You must have a connection to the internet for this to work. This 
command is required only once for each installed version of R. 


Next, select a CRAN mirror when prompted. ‘Melbourne’ should work. 


You should then see something like the following: 


package ‘coda’ successfully unpacked and MD5 sums checked 


package ‘R2WinBUGS’ successfully unpacked and MD5 sums checked, etc. 


Then type 


library("R2WinBUGS") 


Note: This loads the necessary functions and must be done at the 
beginning of each R session in which WinBUGS is to be called. 


You should now see something like: 


Loading required package: coda 
Loading required package: lattice 


Loading required package: boot, etc. 
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Next, create a file called C:/R-3.0.1/BugsCodel.txt 
which contains the following code for a simple Bayesian model: 


model 

{ 

for(i in 1:n)( y[i] ~ dnorm(mu, tau) } 
mu ^ dnorm(0,0.0001) 

tau ~“ dgamma(0.001, 0.001) 

gam «- mu*sqrt(tau) 


} 


Next create a working directory, say C:/R-3.0.1/BugsOut/ 
and proceed in R as follows: 


y «- c(2.4,1.2,5.3,1.1,3.9,2.0) 

n «- length(y) 

data «- list("n","y") 

inits «- function()( list(muzO, tau=1.0) } 


parameters «- c("mu", "gam") 
sim «- bugs(data, inits, parameters, 
model.file= "C:/R-3.0.1/BugsCodel1.txt", 
n.chains = 1, n.iter = 1500, n.burnin=500, DIC = FALSE, 
bugs.directory = "C:/WinBUGS14/", 


working.directory = "C:/R-3.0.1/BugsOut/") 


This sets things up, starts WinBUGS, runs the BUGS code, closes 
WinBUGS, and creates a number of files in the working directory, similar 
to the ones shown in Figure 8.11. 
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Figure 8.11 Files created by running WinBUGS in R 


| * Computer » OS(C) » R3.01 » BugsOut 


View Tools Help 


Include in library w Share with » Burn New folder 


P e Name Date modified Type Size 
oads , a - 
|| codal.txt 17/09/201311:56.. ^ Text Document 23KB 
x 
|_| codalndex.txt 17/09/2013 11:56... Text Document 1KB 
Places 1 > 
: |_| data.te 17/00/201311:56 .. Text Document 1KB 
Drive xz y 
Fi | initsl.txt 17/09/2013 11:56 , Text Document 1KB 
x Files 
|_| log.odc 17/09/2013 11:56 . Microsoft Office D... 11 KB 
Stream p z 
_| log.bt 17/09/200311:56.. Text Document 1KB 
ve 
x = L script.txt 17/09/2013 11:56 ... Text Document 1KB 
Briefcase 3 
| 


These files contain information which can then be accessed within R, for 
example as follows: 


print(sim,digits=4) 

# Inference for Bugs model at "C:/R-3.0.1/BugsCode1.txt", fit using WinBUGS, 
# 1 chains, each with 1500 iterations (first 500 discarded) 

# n.sims = 1000 iterations saved 

# mean sd 2.5% 25% 5096 75% 97.5% 

# mu 2.6358 0.8185 0.9424 2.1760 2.645 3.1175 4.2984 

# gam 1.5380 0.6392 0.3774 1.0935 1.521 1.9360 2.9061 

par(mfrow=c(2,1)) 

hist(simSsims.list$mu, breaks=20) 


hist(simSsims.listsgam, breaks=20) 


After typing these commands, you should see two histograms similar to 
the ones shown in Figure 8.12. For more information on the bugs() 
function, simply type 


help(bugs) 
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Figure 8.12 Histograms obtained in R after calling WinBUGS 
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sim$sims.list$gam 


Note: If your WinBUGS code has an error, the procedure will crash, 
with little to tell you what went wrong. In that case, first iron out any 
‘bugs’ directly in WinBUGS, and only then run your WinBUGS code 
in R, as above. 
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Exercise 8.4 ARIMA modeling and forecasting with WinBUGS 


inR 


Consider the well-known Total International Airline Passengers (TIAP) 
time series, as shown in Table 8.8. This series describes quarterly totals 
of international passengers for the period January 1949 to December 1960. 
(Here, Qtr1 refers to the period January—March, etc.) 


Table 8.8 The TIAP time series 


Year 
1949 
1950 
1951 
1952 
1953 
1954 
1955 
1956 
1957 
1958 
1959 
1960 


Otr1 
362 
382 
473 
544 
628 
627 
742 
878 
972 
1020 
1108 
1227 


Qtr2 
385 
409 
513 
582 
707 
725 
854 
1005 
1125 
1146 
1288 
1468 


Qtr3 
432 
498 
582 
681 
773 
854 

1023 

1173 

1336 

1400 

1570 

1736 


Qtr4 
341 
387 
474 
557 
592 
661 
789 
883 
988 

1006 

1174 

1283 


Using classical methods, fit a suitable ARIMA model to this time series. 


Then forecast the time series forward for one up to twelve quarters. 


Then repeat your analysis and forecasts using WinBUGS called from R. 


Also create a single graph which compares both sets of forecasts. 
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Solution to Exercise 8.4 


Figure 8.13 shows plots of the original times series x,, its logarithm 
(showing stabilised variability), the difference of the logarithm (showing 
a removal of the trend), and y, , the fourth seasonal difference of the first 
difference of the logarithm (showing that seasonality has been removed). 
The last two (bottom) plots are the sample ACF and sample PACF for y,. 


Figure 8.13 Plots for the TIAP time series 
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The last two plots in Figure 8.13 suggest SAR(1) or SMA(1) processes. 
Both fits pass standard diagnostic checks, the second being marginally 
better. Figure 8.14 shows some diagnostic plots for the SMA(1) fit (see 
the R Code below for further details). 


Figure 8.14 Diagnostics for the SMA(1) fit to the TIAP 
time series 
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The chosen SMA(1) model for the TIAP time series x, may be expressed 
by writing 


JE V,Vlogx, , 
where 


y, =w,+0,w,,, w, ~iid N(0,o°). 


The parameter estimates for this model are: 


Ô, = —0.4927 (SE = 0.1201) 


6° = 0.0013. 


Figure 8.15 shows the time series x, plus predictions 12 quarters ahead 


based on the above fitted model. The dashed lines show the 95% 
prediction interval at each of the 12 future times points. (See the R code 
below for details regarding all calculations.) 
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Figure 8.15 Classical forecasts of the TIAP time series 
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We now fit the same model to the time series but using MCMC via 
WinBUGS called from R. Some graphical output from the WinBUGS run 
is shown in Figure 8.16. (See the code below for details.) 


Figure 8.17 shows the Bayesian analogue of the classical forecasts 
displayed in Figure 8.15. 


To compare the classical and Bayesian analyses, we combine the two sets 


of forecasts into a single plot, as shown in Figure 8.18 (page 399). Figure 
8.19 (page 399) is a detail in Figure 8.18. 
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Figure 8.16 Output from an analysis of the TIAP series using 
WinBUGS 
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Figure 8.17 Bayesian forecasts of the TIAP time series 
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Figure 8.18 Comparison of forecasts for the TIAP time series 
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Figure 8.19 Detail in Figure 8.18 
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We see from Figures 8.18 and 8.19 that the two approaches to inference 
have yielded very similar results, at least as regards prediction. 


The Bayesian approach has produced 95% prediction intervals which are 
slightly wider than those obtained via the classical approach. 


It may be argued that such wider intervals are more appropriate, since the 
classical approach makes forecasts without taking into account any 


uncertainty in the parameter estimates. 


By contrast, the Bayesian approach to forecasting does take into account 
that uncertainty. 


To conclude, we report that the fitted model for the TIAP time series x, 
is given by 


y, = V,V logx,, 
with 

y,=W,+O,W,,, w, iid N(0,67), 
where, via classical analysis: 

Ô, = —0.4927 (SE = 0.1201) 

ó? = 0.0013, 
and where, via Bayesian analysis: 


Ô, = —0.4661 (SE = 0.1266) 


ó? = 0.0015. 
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R and WinBUGS Code for Exercise 8.4 


# Classical analysis in R 


x <- 
c(362, 385, 432, 341, 382, 409, 498, 387, 473, 513, 582, 474, 

544, 582, 681, 557, 628, 707, 773, 592, 627, 725, 854, 661, 742, 
854, 1023, 789, 878, 1005, 1173, 883, 972, 1125, 1336, 988, 1020, 
1146, 1400, 1006, 1108, 1288, 1570, 1174, 1227, 1468, 1736, 1283 ) 
n <- length(x); n 4 48 


X11(w=8,h=9); par(mfrowzc(3,2)) 

plot(x,typez"I"); abline(v=seq(0,48,4),h=seq(0,2000,100), Ity=3) 
plot(log(x),type="I"); abline(v=seq(0,48,4), Ityz3) 
plot(diff(log(x)),type="I"); abline(v=seq(0,48,4), Ity=3) 
plot(diff(diff(log(x),lag=4)),type="I"); abline(v=seq(0,48,4), Ityz3) 
y <- diff(diff(log(x),lag=4)) 

acf(y, lag=24) 

pacf(y,lag=24) 


fit1 <- arima( log(x),order=c(0,1,0), seasonal=list(order=c(1,1,0), period=4) ) 


tsdiag(fit1); fit1 


# sari 
# -0.4990 
#s.e. 0.1417 


# sigma^2 estimated as 0.001310: log lik. = 81.12, aic = -158.24 


fit2 <- arima( log(x),order=c(0,1,0), seasonal=list(order=c(0,1,1), period=4) ) 
tsdiag(fit2); fit2 


# smal 
# -0.4927 
# s.e. 0.1201 


# sigma^2 estimated as 0.001306: log lik. 2 81.2, aic = -158.4 
# There’s not much to distinguish the two fits. 


# The second one is marginally better. 
# Let's now display the diagnostics for that fit (again). 
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fit <- fit2; tsdiag(fit) 


# We see that the residuals from the fit are well-behaved, 
# and their sample ACF is consistent with that of white noise. 
H Let's also look at some other diagnostics. These turn out to be OK too. 


X11(w=8,h=5); par(mfrowzc(2,2)) 
acf(fitSresid, lag=24) 
pacf(fitSresid, lag=24) 
qqnorm(fitSresid) 

hist(fitSresid, nclass=12) 


# Check whether to include a mean term 

mean(y) # 0.0008141388 

fit3 «- arima( y, order=c(0,0,0), seasonal=list(order=c(0,0,1), period=4), 
include.meanzT ); fit3 

# smal intercept 

# -0.4937 -0.0003 <--------- not significant 

#s.e. 0.1204 0.0031 

# So there’s no need for an intercept term in the model. 


# Let’s now make some predictions. 

logxpredict <- predict(fit, n.ahead=12) 

xF <- exp(logxpredictS pred) 

xL <- exp(logxpredictSpred - qnorm(0.975)* logxpredictSse) 
xU «- exp(logxpredictSpred + qnorm(0.975)* logxpredictSse) 


cbind(xF, xL, xU) 

# xF xL xU 

# 49 1365.822 1272.412 1466.090 
# 50 1602.240 1449.497 1771.079 
# 51 1916.210 1694.939 2166.367 
# 52 1418.253 1230.895 1634.130 
# 53 1509.806 1264.357 1802.904 
# 54 1771.148 1439.872 2178.641 
# 55 2118.215 1677.977 2673.956 
# 56 1567.764 1213.320 2025.751 
# 57 1668.969 1244.652 2237.940 
# 58 1957.861 1412.873 2713.066 
# 59 2341.516 1640.034 3343.038 
# 60 1733.037 1180.875 2543.381 
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X11(h=5); par(mfrow=c(1,1)); plot(c(0,60),c(0,3800), type="n") 
lines(x, lwd=2); points(x, lwd=2); 

points((n+1):(n+12), xF, pch=16, cex=1.5); 

lines(n:(n+12), c(x[n],xF), Ity=1,lwd=2) 

# points((n+1):(n+12), xL, pch=16); 

lines((n+1):(n+12), xL, Ityz2, lwd=2) 

# points((n+1):(n+12), xU, pch=16); 

lines((n+1):(n+12), xU, Ityz2, Iwdz2) 
abline(v=seq(0,100,4),h=seq(0,4000,100), Ityz3) & OK.... 


# Bayesian reanalysis in R and WinBUGS 


# Assume that R (v3.0.1) is installed in C:/R-3.0.1 
# and WinBUGS (v4.1.3) is installed in C:/WinBUGS14 


install.packages("R2WinBUGS") # Not necessary if done previously 
library("R2WinBUGS") # Necessary every time R is started 


# Make the following directory exists: C:/R-3.0.1/BugsOut/ 
# Create a file called C:/R-3.0.1/BugsCode2.txt with the following: 


for(t in 1:n) ( z[t] <- log(x[t]) } 
for(t in 1:5){ y[t] «- 0; w[t] ~ dnorm(0,tau) } 
for(t in 6:n){ y[t] «- z[t] - z[t-1] - z[t-4] + z[t-5] ) 
for(tin 6:N)( #N=n+12=60 
m[t] <- Phi1*w/[t-4] 
y[t] ~“ dnorm(m[t],tau) 
w(t] <- y[t] - m[t] 
} 
tau ~ dgamma(0.001,0.001) 
Phildum ~ dbeta(1,1); Phil <- 2*Phildum-1 
for(k in 1:12) { 
z[n*k] <- z[n+k-1] + z[n+k-4] - z[n+k-5] + y[n+k] 
x[n+k] <- exp(z[n+k]) 


} 
sig2 <- 1/tau 
} 
OP ——————— —————————— 1 
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# NB: We can't specify Phil ~ dunif(-1,1). This causes an error. 
# Update in March 2014: Phi1 ^ dunif(-1,1) works in WinBUGS 1.4.3. 


x «- c(362, 385, 432, 341, 382, 409, 498, 387, 473, 513, 582, 474, 
544, 582, 681, 557, 628, 707, 773, 592, 627, 725, 854, 661, 742, 
854, 1023, 789, 878, 1005, 1173, 883, 972, 1125, 1336, 988, 1020, 
1146, 1400, 1006, 1108, 1288, 1570, 1174, 1227, 1468, 1736, 1283, 
NA,NA,NA,NA, NA,NA,NA,NA, NA,NA,NA,NA) 


n <- 48; N <- 60; data <- list("n","N","x") 
inits <- function()( list(tauz1, Phildum=0.5) } 
parameters <- c("sig2", "Phi1", "x") 


sim <- bugs(data, inits, parameters, n.thin=1, 
model.file= "C:/R-3.0.1/BugsCode2.txt", 
n.chains = 1, n.iter = 6000, n.burnin=1000, DIC = FALSE, 
bugs.directory = "C:/WinBUGS14/", 
working.directory = "C:/R-3.0.1/BugsOut/") 


H This starts WinBUGS, runs the BUGS code for 6000 iterations, closes 
4 WinBUGS, and creates a number of files in the working directory. These 
H files contain information which can also be accessed within R, as follows. 


print(sim,digits=4) 


# Inference for Bugs model at "C:/R-3.0.1/BugsCode2.txt", fit using WinBUGS, 
# 1 chains, each with 6000 iterations (first 1000 discarded) 
# n.sims = 5000 iterations saved 


# mean sd 2.5% 25% 50% 75% 97.5% 

#sig2 0.0015 0.0003 0.0009 0.0012 0.0014 0.0016 0.0022 

# Phil -0.4661 0.1266 -0.6910 -0.5548 -0.4740 -0.3865 -0.1944 
# x[49] 1367.1820 52.6189 1265.0000 1332.0000 1365.0000 1402.0000 
# 1472.0000 

# x[50] 1605.9746 86.2790 1443.0000 1547.0000 1603.0000 1662.0000 
# 1781.0000 

# x[51] 1918.2346 124.7788 1681.9750 1835.0000 1914.0000 2000.0000 
# 2172.0250 

# x[52] 1422.9222 107.4501 1220.9750 1350.0000 1420.0000 1491.0000 
# 1641.0000 

# x[53] 1517.8472 146.0119 1247.9750 1418.7499 1514.0000 1610.0000 
# 1822.0000 

# x[54] 1783.4306 201.9834 1415.0000 1645.0000 1777.0000 1908.2500 
# 2217.0000 


404 


Chapter 8: Inference via WinBUGS 


# x[55] 2133.7016 273.1291 1646.9750 1946.7500 2119.0000 2306.0000 
# 2724.0000 
# x[56] 1584.1955 223.5842 1187.9750 1431.0000 1576.0000 1720.2499 
# 2066.0000 
# x[57] 1693.4548 276.4929 1211.9750 1499.7499 1674.0000 1857.0000 
# 2309.0750 
# x[58] 1992.9153 364.3849 1370.9750 1742.7499 1968.0000 2204.0000 
# 2837.0999 
# x[59] 2388.4000 476.7169 1589.8999 2058.7500 2345.0000 2668.0000 
# 3453.0250 
# x[60] 1775.0647 381.9082 1137.0000 1511.0000 1735.0000 1992.0000 
# 2628.1249 


help(bugs) #To get info on how to do the following... 


Philv <- simSsims.listSPhi1; sig2v <- simSsims.listSsig2 
xm <- simSsims. listSx 


par(mfrow=c(2,2)) 
hist(Philv, breaks=20); hist(sig2v, breaks=20) 
hist(xm[,1], breaks=20); hist(xm[,2], breaks=20) 


# Let’s now make the forecasts of the series using the BUGS output. 
XF2 <- xF; xL2 <- xL; xU2 <- xU; for(t in 1:12){ 

xF2[t] <- mean(xm[,t]) 

xL2[t] <- quantile(xm[,t], 0.025) 

xU2[t] <- quantile(xm[,t], 0.975) } # Calc. estimates 


par(mfrow=c(1,1)); plot(c(0,60),c(0,3800), type="n") 
lines(x, lwd=2); points(x, lwd=2) 

points((n+1):(n+12), xF2, pch=16, cex=1.5); 
lines(n:(n+12), c(x[n],xF2), Ity=1,lwd=2) 
lines((n+1):(n+12), xL2, Ityz2, lwd=2) 

lines((n+1):(n+12), xU2, Ityz2, lwd=2) 
abline(v=seq(0,100,4),h=seq(0,4000,100), Ity=3) # OK..... 
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# Next we graph both sets of forecasts together in a single plot, 
# and then produce a close-up in that single plot, as follows: 


X11(h=5); par(mfrow=c(1,1)); 


plot(c(0,60),c(0,3800), type="n", xlab="t", ylab="xt") 
lines(x, lwd=2); points(x, lwd=2) 
points((n+1):(n+12), xF, pch=16, cex=1.5, col="red"); 
lines(n:(n+12), c(x[n],xF), Ity=1,lwd=2, col="red") 
lines((n+1):(n+12), xL, Ityz1, lwd=2, col="red") 
lines((n+1):(n+12), xU, Ityz1, Iwdz2, col="red") 
abline(v=seq(0,100,4),h=seq(0,4000,100), Ity=3) 
points((n+1):(n+12), xF2, pch=16, cex=1.5, col="blue" ); 
lines(n:(n+12), c(x[n],xF2), Ity=2,lwd=2, col="blue ") 
lines((n+1):(n+12), xL2, Ityz2, lwd=2, col="blue ") 
lines((n+1):(n+12), xU2, Ityz2, Iwdz2, col="blue ") 
legend(0,3000,c("Classical","Bayesian"), Ityzc(1,2), 
Iwdzc(2,2), col=c("red", "blue"), bg="white" ) 


par(mfrow=c(1,1)) 
plot(c(40,60),c(1000,3500), type="n", xlabz"t", ylabz"xt") 
lines(x, lwd=2); points(x, lwd=2) 
points((n+1):(n+12), xF, pch=16, cex=1.5, col="red"); 
lines(n:(n+12), c(x[n],xF), Ity=1,lwd=2, col="red") 
lines((n+1):(n+12), xL, Ityz1, lwd=2, col="red") 
lines((n+1):(n+12), xU, Ityz1, Iwdz2, col="red") 
abline(v=seq(0,100,4),h=seq(0,4000,100), Ity=3) 
points((n+1):(n+12), xF2, pch=16, cex=1.5, col="blue" ); 
lines(n:(n+12), c(x[n],xF2), Ity=2,lwd=2, col="blue ") 
lines((n+1):(n+12), xL2, Ity=2, lwd=2, col="blue ") 
lines((n+1):(n+12), xU2, Ityz2, Iwdz2, col="blue ") 
legend(40,3000,c("Classical","Bayesian"), Ityzc(1,2), 
Iwd=c(2,2), col=c("red", "blue"), bgz"white" ) 
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9.1 Introduction 


In this chapter we will focus on the topic of Bayesian methods for finite 
population inference in the sample survey context. We have previously 
touched on this topic when considering posterior predictive inference of 
‘future’ values in the context of the normal-normal-gamma model. The 
topic will now be treated more generally and systematically. 


There are many and various ways in which Bayesian finite population 
inference can be categorised, for example: 


situations with and without prior information being available 
sampling with and without replacement 

Monte Carlo based methods versus deterministic (or ‘exact’ ) methods 
situations with and without auxiliary information being available 
scenarios where a superpopulation variance is known and where it is 
unknown 

sampling with equal probabilities versus unequal probabilities 
sampling mechanisms that are ignorable versus nonignorable 

(i.e. biased) 

cases where the order of sampling is known versus where that order 
is unknown 

cases with full response versus where some sampled units fail to 
respond. 


Each of these categories can in turn be broken down further. For example, 
Monte Carlo based techniques may or may not require Markov chain 
Monte Carlo methods for generating the sample required for inference. 
We see there is potentially a vast subject ground to cover. 


We will begin with a description of some basic general concepts, notation 
and terminology in relation to finite population modelling in the Bayesian 
framework, with a focus on simple random sampling without replacement 
(SRSWOR). We then illustrate these ideas by way of a series of exercises 
which also feature some other concepts such as simple random sampling 
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with replacement (SRSWR), nonignorable sampling schemes, and 
covariate data. Some of these ideas will be taken up again in later chapters. 


We defer discussion of Bayesian finite population models involving 
normal (i.e. Gaussian) data to the next chapter (Chapter 10), where such 
models are the focus and treated in detail. In Chapter 11 we will discuss 
data transformations, inference on non-standard quantities of interest, and 
frequentist properties of Bayesian estimators in a finite population 
context, including the notions of model bias and design bias. Chapter 12 
will focus on the issues of biased sampling and nonignorable nonresponse. 


The exposition in Chapters 9 to 12 is largely theoretical but does include 
mention of several real world applications, including on-site sampling of 
recreation parks, oil discovery, and correcting for self-selection bias in 
volunteer surveys. Further discussion of the role that Bayesian methods 
and prior information play in survey sampling and finite population 
inference can be found in Rao (2011). This paper also lists many other 
papers and books on this and related topics, for example Ericson (1969) 
and Sárndal, Swensson and Wretman (1992). 


9.2 Finite population notation and terminology 


Considera finite population of N units labelled i —1,..., N , and let y, be 
the value of the ith unit for some observable variable of interest. 


Define y - (y,,..., Yy) as the population vector. 


Suppose that n units are selected from the finite population without 
replacement. 


We refer to n as the sample size and to m= N —n as the nonsample size. 


Let S= (s,,...,S,) be the vector of the ordered labels of the sampled units. 


Also let r=(n,...,7,,) be the vector of the ordered labels of the 
nonsampled units, i.e. those remaining. 


Define y, 2 (y,,.., y, ) to be the sample vector, and likewise define 


y, 2 Qu. Y, ) to be the nonsample vector. 
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Note 1: With the above definitions, it is always true that 
Cp eee 

and 
Leal, 


irrespective of the order in which the population units may actually be 
sampled. Also, 
Is cse ro c dledk poc 


Note 2: For mathematical convenience, the population, sample and 
nonsample vectors may later sometimes be defined as the column 
vectors 


Yi 
y Qu yy) = : s Va y e y and M sync er 
Yu 
respectively. 


Also, the population vector may sometimes be written using upper case 
letters, as Y = (Y,,..., Y,) or Y = (Y,.., Y,)'. For the remainder of this 
chapter, these alternative notations will not be used. 


Example: Suppose that we select n = 3 units from a finite population of 
size N = 7 and obtain units 4, 5 and 2 (in that order, or any other order). 


Then the nonsample size is m= N —n = 4 and: 


y (uy) 
S =(S,,S,,8,) = (2,4,5), y, "(Yo Ys) 
r 2(n,r,r, 7r) 2 (53, 6,7), V S Ys Yo Y): 


9.3 Bayesian finite population models 
Consider a finite population vector y which may be thought of as having 
been generated from some probability distribution which depends on a 


parameter 0 (possibly a vector). 


Also suppose that a sample of size n is drawn from the finite population 
without replacement according to some probability distribution for s. 
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This scenario may be expressed in terms of a Bayesian finite population 
model with the following form: 


f(s|y,@) | (the probability of obtaining sample s for given 
values of y and 9) 


f(y]0) (the model density of the finite population vector 
given 9) 
f (0) (the prior density of the parameter). 


Suppose that we have data of the form D - (s, y,) and are interested in a 
quantity Q = g(y,@), for some function g. Then the task is to determine 
the distribution of Q given D. 


This distribution will be based on the joint distribution of the two 
unobserved quantities @ and y,, given the two observed quantities, 


namely: 
S (which tells us which units are sampled); and 
y, (the vector of the values of the sampled units). 


Thus, inference on the quantity of interest Q = g(y,0) is based on the 
density f (Q| D), which in turn is based on the density 
f (0. y, Is. y.) c F(A y, y,.s) 
= FOFO y, IO) f (sl y, y,. 0). (9.1) 


Note 1: The values of s and r here are fixed at their observed values 
defined by the data. Thus, given D=(s, y,), we may always express 
Q - g(y,0) as h((y,, y,),0) for some function h (which will in many 
cases be the same function as g), and there should be no ambiguity in 
the meaning of quantities such as f (y,, y, |0) in (9.1). 


Note 2: We have specified the sampling mechanism in terms of the 
quantity s which tells us which units are sampled but not the order in 
which they are sampled. In some cases it may be appropriate to replace 
f (s| y,0) in the model by f (L| y,0), where 
L - (L, ..., L,) 
is the vector of the labels of the selected units in the order that they are 
sampled. L provides more information than s, which is a function of L. 
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Note 3: We have assumed that sampling is without replacement. If 
sampling is with replacement, it may be appropriate to replace f (s| y,0) 


or f(L|y,0) in the model by f (1| y,0), where 
LSU d). 


and where I; is the number of times that population unit i is sampled. 


In this case it may be necessary to modify the notation to account for the 
number of distinct units sampled, previously the fixed constant n, due to 
the possibility of multiple selections under sampling with replacement. 


Example 1: Suppose that we sample units 4, 5 and 2, in that order. Then 
L -(L, L,, L,) =(4,5,2) and s = (2,4,5). Note that s is a function of L. 


Example 2: Suppose we sample units 3, 5 and 3, in that order. Then 
L-(L,L,,L.) -(3,5,3) and 1 —(0,0,2,0,1) . In this case, we write 
S = (S,,...,$4) = (S, S) = (3,5) as the ordered vector of distinct labels for 


the units sampled. Here, d is the number of distinct units sampled (a 
random variable with realised value 2), in contrast to n, the total number 
of selections (a fixed constant equal to 3). Note that d is a function of s, 
which is a function of I, which in turn is a function of L. 


9.4 Two types of sampling mechanism 


There are basically two types of sampling mechanism in the context of the 
above model, data and quantity of interest. These two types correspond to 
two distinct cases, as follows: 


(i) where f(Q|D) remains exactly the same if the sampling density 
f (s| Y., y,,9) is omitted from the calculation at equation (9.1); 


in this case we say that the sampling mechanism is ignorable (or 
unbiased) 


(ii) where f(Q|D) changes in some way if the sampling density 
f (s| Y., y,,9) is omitted from the calculation at equation (9.1); 
we then say the sampling mechanism is nonignorable (or biased). 


411 


Bayesian Methods for Statistical Analysis 


Perhaps the simplest example of an ignorable sampling mechanism is 
simple random sampling without replacement (SRSWOR), for which 


NY 
feiro} , SES(S), 


where 
S(s) ={(L..., 1), (L, 2,...,n—1,n +1),...., (N-n +1,..., N)). 


is the sample space for s (the set of all possible combinations of n integers 
taken from N). 


In this case, f(s| y,0) does not depend on y or @ at all and so may also 
be written simply as f(s). This then guarantees that 


NY 
f(sly,.y,,0)-7 re- Cr) 


at the single observed value of s, whatever that value may be. 


Therefore, the joint density of the two unknowns is 
f (0, y, | S, y,) © TY a8) 


= f(A) f Cy,. y. |O)f (s| y, y, 0) 
x f(8)f (y, y, 14) x1, 
which is the same as (9.1) but with f (s| y,, y,, 0) omitted. 


This result tells us that f(Q|D) will be the same when the sampling 
mechanism density f/(s|y,, y,,0) is ‘ignored’ in the model, so to speak. 


9.5 Two types of inference 


There are basically two types of inference in the context of the above 
model, data and quantity of interest: 


(a) where Q does not depend on y, in which case inference is on 
Q = g(0) (a function of only 9) and may be called analytic 


inference or infinite population inference or superpopulation 
inference 


(b) where Q does not depend on @, in which case inference is on 
Q= g(y) (a function of only y) and may be called descriptive 
inference or finite population inference or predictive inference. 


412 


Chapter 9: Bayesian Finite Population Theory 
9.6 Analytic inference 


In the case of analytic inference, this is based solely on the posterior 
density of the model parameter 0 , namely 


f (0| D)- f(s, y.) 
ac f (0,5, y.) 


= [ f(6,s y. y. dy, 


= F| foy 1 f GS Ys Y, Oy, . 


Now suppose further that the sampling mechanism is ignorable. In that 
case, 


f(O|D) = F| f Gr. y, 10) x1dy, 
since f(s|y,, y,, 0) may be ignored 
MOLAO fy, Ly. Ady, 


since f(y,,y,]0) 7 f Cy, |O f Cy, |0. y.) 
= f(0)f y, 0) 


since f f(y. lY 0dy, =1 forall 8. 


Thus the posterior density of @ is obtained in exactly the same way as in 
previous chapters. 


Note: As stressed earlier, it is to be understood that s in f(y,|0) here 


is fixed at its observed value. With this understanding, we will 
sometimes abbreviate f(0|D)- f(0|s,y,) as f(0|y.). 


Example: If s = (2,4,5) , then y, means specifically (y,, Y4, Ys). Thus, 
in this context, y, does not refer to the vector (y, , y, , y, ) with the 


subscripts 5,,5,, S, as random variables. 
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9.7 Descriptive inference 


In the case of descriptive inference, this is based solely on the predictive 
density of the nonsample vector y,, namely 


f(y, D) = fi. Is.y.) © fis; y, y.) 


= [f(0.s. y, y)40 = | f) fy. 1 f Iv, y, 0)0. 


Now suppose further that the sampling mechanism is ignorable. In that 
special case, 


FG, |D) | f(9) f Gr. y, 16) 10 


since f (s| y,, y,, 9) may be ignored 


= | FU, Ly) FO) f C, 1640 


æ f f(y, ly.) F(Aly.)d0 
since f(8|y,) ec f (6) f(y,|9)- 


So the predictive density of y, is obtained in exactly the same way as in 
previous chapters. 


Note: As before, it is to be understood that s and r in f(y, | y,,0) and 


f (8| y,) are fixed at their observed values. With this understanding, 
we will sometimes write 


f(y, |D)= f(y, | s. y.) as KOADE 


More generally, we will sometimes write 


f(0,y,|D) - f(8,y.|s,y,) as f (8, y, | Ys); 
and 


f(Q|D)- f(Q|s. y.) as — f(Q|y). 


Example: If s = (2,4,5) and N = 7 then y, means (y,, y,, Ys) and y, 
means (y, Yz» Ye» Y7) - 
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Exercise 9.1 A Bernoulli finite population model with ignorable 
sampling 


A finite population of size N = 4 consists of values that are independently 
and identically distributed (iid) Bernoulli with parameter 9, where @ isa 
priori equally likely to be 1/4 or 1/2 (with no other possibilities). 


We sample n - 2 units from the finite population according to SRSWOR. 
Units 2 and 4 are sampled, and both have the value 1. 
(a) Find the posterior distribution of 9 . 


(b) Find the predictive distribution of the finite population total, namely 
Ve SV Tuy. 


Solution to Exercise 9.1 


(a) The Bayesian model here may be written: 


NY! (4? 4 
reizo] L2 E. 


s = (1,2), (1,3), (5 4), (2,3), (2, 4), (3, 4) 
fo16)-  [e"a- oy" 


(the model density of the finite population values) 
f(8)21/2, 0-1/4,1/2 (the prior density of the parameter). 


The observed sample data is 
D x (s, y) = ((s;, 55); (Yso Ys, )) = ((2, 4), (Yz, Y4)) = ((2, 4), (1,1)) ’ 
and the nonsample vector is y, = (y, Y, ) = (y. Yz) € 10, 1P. 


The sampling mechanism is ignorable, and so 
f(0|D) « f(6)f(y,|0) «1x[[e"a-o)^ 
—-0' sincen-2and y; -1forallies 
| Q/4y 21/16, 0=1/4 
(1/2)? 24/16, 0-1/2. 
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1/5,0-1/4 
It follows that f (0| D) = 
4/5,0-1/2. 
(1-0), y, - (0,0) 
(1-0)8, y, - (0,1) 
0(1-0), y, - (10) 
e, y, = (1,1). 


(b) Next, observe that f (y, | D,@)= 


This implies that 
f, |D) - $, f Gy, | D,O) f (0| D) 


2 2 
1Y1 214 25 
1 +} 1 =—, = (0,0 
C s (3) grap 709 
1 111 yf, eed oeil 
B 4/45 4/45 80 


dp E-S y, - (,0) 


4. 4/5 44 4/5 80’ 
2 2 
1Y1 (2Y4 17 
+ =—, y, =(L)). 
HEBES EA 


The nonsample totalis y,, = y, + y}, with three possible possible values: 


0-020 
0+1=1+0=1 
1+1=2. 
25/80, y. =0 
Therefore f(y,, |D) =, 38/80, y,, =1 
17/80, y,, =2. 


The finite population total is y; = Yr + y,,, where Yr =Y, +y, 7-141 
— 2 is the sample total. It follows that the required predictive density of 
the finite population total is 
25/80, y, - 240-2 
f(y, | D) 24 38/80, y, 224123 
17/80, y, =2+2=4. 
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Exercise 9.2 A Bernoulli finite population model with 
nonignorable sampling 


A finite population of size N = 4 consists of values that are conditionally 
iid Bernoulli with parameter 0, where @ is a priori equally likely to be 
1/4 or 1/2 (with no other possibilities). 


We sample n = 2 units from the finite population without replacement in 
such a way that the probability of selecting a sample is proportional to 
the sum of the values in that sample. 


Units 2 and 4 are sampled, and both have the value 1. 
(a) Find the posterior distribution of 9 . 


(b) Find the predictive distribution of the finite population total, namely 
Vr = JX Tt Vy 


(c) Find the conditional posterior distribution of 0 given the nonsample 
vector, and then employ this distribution to check your answer to (a) using 
results in (b). 


(d) Find the following probabilities of selection into the sample: 
(i) P(es]|y,0) (i) P(ies|y) 
(iii) P(ies|0) (iv) P(ies). 


Solution to Exercise 9.2 


(a) The Bayesian model here may be written: 
f(s|y,0)« ya, s-2(,2),0,3,0,4),(2,3),(2,4), (3.4) 


f(y|a)=[[e"a-a™ 


i=1 
(the model density of the finite population values) 
f (0) =1/ 2, 8 =1/4,1/2 (the prior density of the parameter). 


The observed sample data is 
D =(s, y,) = ((S,, 52), 0¥,, Ys )) = (2,4. (Yo, 4) = (2, 4), LD), 
and the nonsample vector is 


y." Quy) =v Y3) € {0,17 . 
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In this case the sampling mechanism is nonignorable and the first thing 
we should do is determine the exact form of the sampling density of 
S = (s, S,) . Now, 

f(s| y,0) = cy, 2 c(y, + Y.) 
for some constant c such that 


1- 5 f(ly,0) 


-cl(y + yt (y, + y3)+ Qu + yoQ; + y¥,)+(y, + y,)+ (y+ y4)} 
- c(X(y, + y; + ys y1)] = 3oy,. 


We see that c =1/(3y,), and so 


Yet Yg 
f(s | y» 0) us E X s= (s,, $,) B (1 2j, (1, 3), (1, 4), (2,3), (2, 4),(3, 4) . 


Note 1: This formula shows explicitly how the sampling mechanism 
depends on the values in the finite population vector y. It also shows 
that, conditional on y, the sampling mechanism does not depend on the 
superpopulation parameter ø. 


Note 2: This formula is only true when the finite population total y; is 
positive, i.e. when at least one of y,,..., y, is nonzero. In the case where 
all population values are zero, we have that Y = y, + y, =0 for all 
possible samples s=(s,,S,), and consequently f(s|y,0)oc0, which 
must be understood to mean that that no sample actually gets drawn. The 
fact that a sample has been observed implies f(s|y,@)>0 for at least 
one value of s, which implies that at least one population value is 
positive, which in turn implies that y, > 0. This would be true even if 
all the sample values were zero; but as it happens, at least one of them 
is positive (in fact both are), which in itself implies that y, >O. 


We may now work out the joint density of all quantities in the model: 
FO, Y ys) - FOFO IOI, |O f (s y, y, 0) 


" Yi (1 _ gyvi x Yi (1 _ gyvi mu 
-> (pe (1-0) (pe (1-0) ) Aa 


ies ier 3y, 
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1+1 


= t x Q? x Q^*» - gy» x 
2 3(y,+1+ y, 4 1) 


Qo» (1 _ gy» 
oc —<—$<_=—— 
2+yY,+Ys 


So the posterior density of 9 is 
f(0|D)- f(8|s. y.) 
« f(0,s y) 


=> f(0.s y, y) 
Yr 
1 1 0 Yi tY3 1 
c 0*(1- 0) 5) E = 
dd 1-8 2t y, ys 
04-0 0+1 
1-0 2+0+0 1-08 2-041 
Go) aia) 24 
1-0 2+1+0 \1-6 2-141 
2 
eco G5 
2 411-073 1-073 511-0; 4 
- =, fea- 6y +86°(1— 0) + 30°} 

2 2 3 4 
Se) eV} e 
12 4 4 4 4 4 4 

2 2 3 4 
OJ 40-3497] 7 
12 4 4 4 4 4 4 


1 

» 12056) 00) 869» 30], 0-7 
2 
12(256) [ao «&16) +306)], 8-7 


6(9) +83) +3) =81, 0-— 


E 


6(16)+8(16) +3016) -272, 0-7 
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Now 81 + 272 = 353, and so 
81/353=0.22946, 0-1/4 
f(0|D)- 
272/353 = 0.77054, 21/2. 


(b) The predictive density of the nonsample vector 
Y, = (Y; Yn) Oy 
is 
f(y, | D) = f Cy, | S, y,) «c [CY 8. y.) 
= f(Gs.y,. y.) 
0 


1 ) 2y- 
oc 0 t*tyutys a- 0) Yi-y3 
2+y +y, ds. 


1 1 2*yy* ya 1 2-yi-Ya 5 2*yy* ya 2 2-y1-73 
Sew) b “a, P9 
2+y +y, [V4 4 4 4 


B 1 164 3275 
(2+y, + y.)256 2+y ty, 
164-3^?? 25 150 

= = ; : = 0,0 
Stöt 2 12 Ox Ya) = (0,0) 
16+3°°' 19 76 

= Se nN , = 0,1 

22021 3 42 Qi y) = (OD 
16+3°'° 19 76 
2+1+0 3 12 
1649 7 17 51 
ao on ee , = 11 
2+1+1 4 12 Ov yJ= OD 

75, (ys. 3) = (0,0) 

38, (y,y4) - (0.1) 

38, (y,,¥3) = (5,0) 

24, (Y ¥3) = (L1). 


[277 +16} « 


, (YoY) = (1,0) 


Now, 150 + 76 + 76 +51 = 353, and so 

150/353 = 0.42493, y, =(0,0) 
76/353= 0.21530, y, =(0,1) 
76/353=0.21530, y, = (1,0) 
51/353 = 0.14448, y, = (1,1). 


f(y, |D)= 
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So the predictive density of the nonsample total, 
Ya Yr T Yr ~ Yi ys 
150/353 = 0.42493, y„=0 
is f (y,7 | D) =4152/353 = 0.43059, y„=1 
51/353 = 0.14448, Yr =2. 


So the predictive density of the finite population total, 
Yr =Ysr t+ Ya, =A+)+Y,7, 
150/353 = 0.42493, y,=2 
is f(y, | D) 24152/353—0.43059, y, =3 
51/353 = 0.14448, y, =4. 


(c) The conditional posterior density of 0 given y, is 
f (8l y, y, s)oc f (6, y, y, s) 
A 02I (1- gy» : 


We now need to consider all the possible values of y, , one by one. 


For y, = (0,0): 


f (8 | y,, y,, s) oc 9g»? gy = 


9/25, 0-1/4 


=> f(@ly.y..s)= 
POV Yes 35.8) Ps -1/2. 


For y, = (0,1): 


f (8| y, y, s) oc pog gy " 


3/19, 0=1/4 


=> f (8| y., Y, S)= 
PAL Ye-Iro8) bone 8-1/2. 
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For y, = (1,0): 


f (0 | Y:33,58) oc era ey » 


3/19, 0-1/4 


=> f (8| y., Y, S)= 
F(013535,8) fune pap? 


For y, - (L1): 
fOlysy, seo" G-oy = 


1/17, 0-1/4 


=> f (0| Y., Y, Ss)= 
F(81y, ys) me 0-1/2. 


Now, 


f(81y,s)* >> f (8, y, I/43995» f (8| y,. y, S) f Cy, 1358). 
Yr Yr 
So, using results in (b), we have that: 


(eus SX t(0- 


Lo ee S — x — us +— 3 x— /b Peas ak = 0.22946 
25 353 19 353 19 353 17 353 


229 fG.ly,s) 


f(8-1/2|y,s)- HGE 


.16,180 16. 7816, 76 d gee. 
E 353 19 353 19 353 17 353 


YS SUCI 


These results are all in agreement with those obtained in (a) using a 
different approach. 
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(d) (i) The probability of selecting unit i into the sample given y and @ is 
the same for all i, in particular i = 1, and so may be written 


P(es|y,0) - Y f(ly.0) 7, (0 3901390 »2] 


sies T 
"UE EE LE SN 
3y; 3 3y, 
assuming that y, » 0; otherwise, P(1es| y,0) - 0. 


1 2y, 
—-+—, y,>0 
Thus, for each i =1,...,4 we have that P(ies|y,0)-43 3y, 
0, y, =0. 


As a check, we may ask whether the sum of these inclusion probabilities 
equals n = 2. 


The answer is yes, assuming that y is such that y; > 0; in that case, 


x : | 2(y EX * Y. 
Y. Pües|y,0) - Y 1,29 |. 4, 20 ET) omn 
esi EAS 3yr 3 3yf 


(ii) Since P(i € s | y,0) does not depend on @, we also have 


2 y, 
NN Yr > 0 
P(ies|y) 2-49 3yr 
0, y,-0. 


(iii) The probability of selecting unit i into the sample given @ is the same 
for all i, in particular i = 1, and so may be written 


P(ies|8)- Pdes|0) - Y Pües|0O.y)f(y|0) 


4 
-0xP(y = (,0,0,0)0)4. F [2:23 f ea- o^ 
{=f 


y:Yr>0 T 


= 3 l, 2y, Q^" (1- gy - 0.34180, 0 =1/ 4 
3 3y 0.46875, 01/2. 


y:yr>0 


These numbers were obtained by writing and implementing a suitable 
function in R (see the R code below). 
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(iv) The unconditional probability that any particular population unit i will 
be selected into the sample is 


P(ies)- > PG es|9)f(0) 
0 
1 1 
= 0.34180 x 2 0.46875 x ~ 0.40527. 


To check this result, we note that the sum of inclusion probabilities 
should in this case be identical to the expected sample size. 


4 
The first of these quantities is > PG € S) 2 4x 0.40527 = 1.6211. 
i-l 


The second of these quantities can be obtained by first noting that 


P(y, 2010) - (1-0) = (3/4)* 281/256,0 23/4 
| (2/ 4)* 216/256,0 =2/4. 


This implies that 
81 1 16 1 97 
P(y, =0)= » P(y, -0|0)f(0) - —x—t — x= 
? 2, i 256 2 256 2 512 


= 0.18945. 


The sample vector has size 2 if y, » 0, and size 0 if y, =0. So its 
expected size is 0x0.18945+ 2x(1— 0.18945) = 1.6211, which is the 


4 
same as $' P(i e s) above. 
i=1 


R Code for Exercise 9.2 

# (a) & (b) 

options(digits=5) 

kern=function(th,yr){ th4(2+sum(yr))*(1-th)4(2-sum(yr))/(2+sum/(yr)) } 

kernth0.25 = kern(th=0.25,yr=c(0,0))+ kern(th=0.25,yr=c(0,1))+ 
kern(th=0.25,yr=c(1,0))+ kern(th=0.25,yr=c(1,1)) 


kernthO.5 = kern(th=0.5,yr=c(0,0))+ kern(th=0.5,yr=c(0,1))+ 
kern(th=0.5,yr=c(1,0))+ kern(th=0.5,yr=c(1,1)) 
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postth=c(kernth0.25, kernth0.5)/( kernth0.25 + kernthO.5) 
postth # 0.22946 0.77054 


kernyrOO = kern(th=0.25,yr=c(0,0))+ kern(th=0.5,yr=c(0,0)) 
kernyrO1 = kern(th=0.25,yr=c(0,1))+ kern(th=0.5,yr=c(0,1)) 
kernyr10 = kern(th=0.25,yr=c(1,0))+ kern(th=0.5,yr=c(1,0)) 
kernyr11 = kern(th=0.25,yr=c(1,1))+ kern(th=0.5,yr=c(1,1)) 


postyr 2c(kernyrOO,kernyrO1,kernyr10, kernyr11)/ 
(kernyr00+kernyr01+kernyr10+kernyr11) 

postyr # 0.42493 0.21530 0.21530 0.14448 

# (c) 


sum(c(9/25,3/19,3/19,1/17)*postyr) # 0.22946 Correct 
sum(c(16/25,16/19,16/19,16/17)*postyr) # 0.77054 Correct 


# (d) 
probfun=function(y,th){ yT=sum(y); res=0 
if(yT>O) res = ((1/3) + (2/3)*y[1]/yT) * th4yT * (1-th)^(4-yT) 


res } 


mati=matrix(c(0,0,0, 0,0,1, 0,1,0, 1,0,0, 0,1,1, 1,0,1, 1,1,0, 1,1,1), 
byrow=T, nrow=8,ncol=3) 


mat2=rbind(mat1,mat1); ymatzcbind(c(rep(0,8),rep(1,8)), mat2) 


ymat 

# [1] 0000 
# [2] 0001 
LAEE 
# [15] 1 1 10 
# [16] 1 1 1 1 


prob0.25=0; for(i in 1:16) prob0.25 = prob0.25 + probfun(y=ymatļ[i,],th=0.25) 
prob0.5=0; for(i in 1:16) prob0.5 = prob0.5 + probfun(y=ymat[i,],th=0.5) 


c(prob0.25,prob0.5) # 0.34180 0.46875 
(prob0.25+prob0.5)/2 # 0.40527 
4*(prob0.25+prob0.5)/2 tt 1.6211 
c(97/512, 2*(1-97/512) ) 4 0.18945 1.62109 
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Exercise 9.3 A finite population Bayesian model with SRSWOR 
We sample n = 2 units from a finite population of N = 4 via SRSWOR. 
If 0 =0 then the finite population vector y is equally likely to be each of 
the following: 

(0,0,0,0), (0,0,0,1), (0,0,1,1), (0,1,1,1). 
If 0 —1 then the finite population vector y is equally likely to be each of 
the following: 

(1,1,1,1), (1,1,1,0), (1,1,0,0), (1,0,0,0). 


A priori, the parameter @ is equally likely to be 0 or 1 (e.g. according to 
the toss of a coin). 


Suppose we sample units 2 and 3, with values 1 and 1, respectively. 
(a) Find the posterior distribution of 0 . 


(b) Find the predictive distribution of the finite population mean, namely 
Y=(y, +--+ Yy)/N. 


Solution to Exercise 9.3 


The easiest way to do this exercise is to first identify eight equally likely 
possibilities to start with. These possibilities are: 


1. 6 =0, y = (0,0,0,0) with y =0 
2. 0 =0, y= (0,0,0,1) with y = 1/4 
3. 0 =0, y= (0,0,1,1) with y = 1/2 
> 4. 0-0,y * (0,1,1,1) with y = 3/4 
> 5. 0-Ly-7(11,1,1) with y =1 
> 6 8 =1,y= (1,1,1,0) with y = 3/4 
7. @=1,y=(1,1,0,0) with y = 1/2 
8. @ =1, y= (1,0,0,0) with y = 1/4. 


After observing y, =(y,,y,)=(1,1), there are only three possibilities 
remaining (4, 5 and 6 in the list, each highlighted by an arrow). 
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(a) Two possibilities out of the 3 correspond to 0 —1 (namely 5 and 6) 
1/3,0 30 
and one to 0 — 0 (namely 4); consequently, f (0| D) = s TP ii o 


equivalently, (0 | D) ~ Bern(2/3). 


(b) Two possibilities out of the 3 correspond to y =3/4 (namely 4 and 
2/3, y=3/4 | 


6) and one to Y = 1 (namely 5); therefore f(y|D) = u 
1/3 y=i 


Alternative solution 


The above results can also be obtained by working through in the style of 
the solutions to previous exercises, as follows. Before the data is observed, 
the Bayesian model may be written: 


NY! (4? 4 
foly.9=("] EN 76 


s (1,2),0,3),(,4), 2,3), 2,4), (3,4) 
fo19--. y =(0,0,0,0), (0,0,0,1- 0), 


(0,0,1—0,1—0), (6,11—0,1—0,1— 0) 
f(@) =1/2,0=0,1 (the prior density of the parameter). 


The observed data is D = (s, y,) = ((2,3), (,1)) . At this particular value of 
the data: 


f(s|y,A)=—,s=(2,3) (the value of s actually observed) 


= 
6 
f(10)- 5. Y- (0111) and 6-0, 

y € {(1,1,1,1),(1,1,1,0)} and 09-1 (where we need 
only consider values of y consistent with the data) 


f(0)=1/2, 9280.1 (since both values of @ are still possible, 
i.e. consistent with the observed data). 


With the quantities s= (2,3), y, 2 (y, y,) - (L1) and y, 2 (y, y,) all 


fixed at these values, the joint density of all quantities in the model may 
be written 
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f (6,5. y) * f(6,s. y, y.) = FOFO y, 10) f (S |y, y, 0) 


1 


_ I(0 € {0,1}) " I(y =(0,1,1,1),0 =0)+ I(y €(0,,11),(5,11,0)),0 21) » 


2 4 
 I(y, =(0,1),0=0)+ Iy, € (1,2, (,0),0 - 1). 


(a) It follows that 


f(A|D) = f(6,s,y,) - >) f(8,s. y) 


> (y, (0,1) «1, 0=0 


7 (y, €(1,0,(,0)) 22, 0-1. 


m 1/3,0=0 
After normalising, we see that f (0| D) — . 
243 021 


(b) Also, 
f y, | D) © f(yss y) = >, FOS, y, y.) 


DLO, 2 (0010,80 20) 1(y, €(0,0,(,0),0-0]-21 y, = (0,1) 
«+> [I(y, =(0,1),0=0)+I(y, €£0,0,05,0),9-0]-1 y, =) 
»» 


[1y, = (0,1,80 20) € I(y, €(0,1,0,0),071]-1 y, = (1,0) 


which implies that f(y, |D)=1/3, y, -(0,1,(51),(,0). 
Consequently, f(y|D) -1/3, y 2 (01,51, (,11,1), (511,0). 


Now, the values of y listed here as possible given the observed data have 
means 3/4, 1 and 3/4, respectively. 


It follows that the predictive density of the population mean is 


2/3, y=3/4 
raim-| j 


as was obtained previously). 
1/3, y=1 T : " 
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Exercise 9.4 Length-biased with-replacement sampling from a 
Poisson finite population 


A finite population of size 9 consists of values that are conditionally iid 
Poisson with a mean whose prior distribution is gamma with both 
parameters zero (considered uninformative). 


We sample 3 times from the finite population according to a with- 
replacement sampling scheme, where on each draw the probability of 
selecting a unit is proportional to its value. 


Unit 2 is selected once and its value is 1. 
Unit 4 is selected twice and its value is 3. 


Find the posterior distribution of the Poisson mean and the predictive 
distribution of the nonsample total. 


Also find these distributions under the (false) assumption that the 
sampling is SRSWR. 


Then create two plots which suitably compare the four distributions 
indicated above. 


Note: The concepts here involve a biased sampling mechanism and are 
relevant to on-site sampling, where for example we wish to estimate the 
total number of times that visitors (or potential visitors) to a recreational 
park actually visit there in some specified time period. 


If we go to the site at random times to survey visitors, we are more likely 
to interview people who come very often relative to those who come 
only rarely. This means that we may end up over-estimating the 
popularity of the park—unless we make a suitable correction 
(downwards) to account for the (upwardly) biased sampling mechanism. 
If a potential visitor to the site doesn't come at all, then there is zero 
chance of sampling them. 


If we wish to consider only the population of persons who actually visit 
the site in a given period (i.e. to exclude the potential visitors who do 
not visit), we may need to consider a truncated model involving the 
Poisson random variable conditional on it being non-zero. For further 
details and a discussion of the modelling issues here, see Shaw (1988). 
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Solution to Exercise 9.4 


Generally, we are considering a sample of size n obtained with 
replacement from a finite population with values, y,,..., yy which are 


conditionally Poisson with some mean 4 , where the prior distribution on 
A is gamma with parameters 77 and 7 (and mean 7/7 ). 


Let I; be the number of times population unit i is sampled and define 


N 
I*(L; I) Then let d = » KT » 0) be the distinct sample size (the 
i=1 
number of distinct population units sampled), and let m= N —d be the 
nonsample size (the number of units not sampled). 


In this scenario, we define the sample vector as y, =( Viste Ys ), where 


S =(S,,...,8,) is the vector of the labels of the d distinct units that are 
sampled, and we define the nonsample vector as y, - ( y, ,..., y, ) , where 


r — (n,...,r,) is the vector of the labels of the m units that are not sampled. 


Note: Here, s is a function of I, and so the data in this situation could 
also be written as D - (I, y,). 


Since we are interested in the nonsample values only by way of their total 
y,,, a Suitable Bayesian finite population model in this context is: 


EE: - Yi j 
false , 


Diar 
I e((a,...,ay):a; € 0,1,.., n) Vi, a, +...+ay =n} 


e^A" | e" (mA) 
f Qs y4 14) (nez ee" 
rr* 


ies i* 


A~ G(,1). 


In our specific situation, N = 9, n = 3, and the data is 

D =(I, y) = ((0,1,0, 2,0,0, 0, 0,0), (1,3)), 
meaning that unit 2 is selected once and its values is 1, and unit 4 is 
selected twice and its value is 3. Thus d = 2 and m = 7. Also, 7 = 0 and 
T -0. 
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On the basis of these specifications, we wish to make inferences about A 
and the nonsample total, 
Yor = Yi * Ya Ys * Ya * Ys + Ya * Yo- 


Note: The probability of sampling unit 2 once and unit 4 twice (as is 
assumed to have occurred) equals 


ul 2 I; 
1205707998 Vays Viva vavn Galea 8 (2) 
Yr Yr Ye Yr Yr Yr Yr Yr Yr 2N Yr) Yr Tis eese 


and so is consistent with 


fly ge cT: 


iat; ia\ Sr 
as specified in the general model. 
For this exercise we will first derive the predictive distribution of y,, and 


then use this to obtain the posterior distribution of A only afterwards. The 
predictive density of y,, is 


Fn |D) [ fos y, 1 Ada 


=f PALOMA fOr AFU Drs yas dA 


oo —mA Y 
ac f Ae? x i pm ) ea dA 
0 Yr! Yr 


N 1 Ij 1 n 
(note that 6 = (+) ) 
H Yr Yr 
m^ 1 


- — [arate ima 
Ya! Yr 

om" 1 x Ert Ver + Ya) "1 
ult (rd im 


ies 


oo 
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Thus 
k 
f Or | D) = “Ua, y, =0,1,2,..., 
where 
k = N-dY" T+ Vat Ya) 
Ode | TEE 
N+t Val Vota) 
and 
c= > k(y,7). 


Yir=0 


Note: Here, d + m = N, and so 


rr 
(m HE d = Mes = (N 3E (o REC oc(.N EI pye f 


We may approximate f(y, |D) by calculating k(y,,) only for 
Y,r =9,1,2,....M for some large integer M (in practice we used 100) for 


which k(y,,) is sufficiently close to zero. 


Using the predictive density of y,p, we can now obtain the posterior 
density of A as 


FAID)= Y FAID, y, f Qn 1D), 


JrT=0 


where 
(ALD Yr) ~ G+ gt Vet tN): 


Note: This result is obvious but can also be obtained as follows: 


PAID. Y 2) = FAIS YY, r) © fUos Y Vind © FAVE» Yer 12) 
-4 7 y: -m4 ym 
x je [TS A » e" (mA) 
ies yd Ven 
oc Justo xpi qs xp aquis 


ENT C e 
= Arts Since d + m= N): 
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We see that f (À | D) is an infinite mixture of gamma densities where the 
weight assigned to each one is the corresponding (marginal) predictive 
density of y,,. 


Note: An alternative way to derive f (A4 |D) is using the equation 


VEO yn yan 


Yer 
The case of SRSWR 


In the case of SRSWR, the sampling density 


füly.)- L WES 


i=1 ti i=l Yr 
changes to 
ni erqY n! 
I vi = = ; 
(aia ell) aee 


which we note does not depend on A or y,, and so can be ‘ignored’. 


The result is then almost the same as before, the only difference being that 
the term 


M yi = Yr = (Yar + Yr)" 
in 
k — Nd "m T+ Yr +Y,r) 
w E] a yas 
N+t Vea Wer + Ya) 
is replaced by 1. 


Thus under SRSWR we find that 
K 
[s 1D) — — 


where 
N -d Y" T(g* ys - Yer) 
Ka)=( ) (7 -— Yr 
N «c Yar! x1 
and 
C= > Kyn). 


Yrr=0 
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As regards the posterior distribution of 4 under SRSWR, this need no 
longer be expressed as an infinite mixture of gamma distributions but 
simply as 

(4| D) * G(g * ys,7 +d). 


Figure 9.1 shows the posterior density f (4| D) under the length-biased 
and SRSWR assumptions, respectively. 


We see that the inference under the assumption of length-bias is the lower 
of the two. This is because it appropriately corrects for large finite 
population values being more likely to be selected. If we ‘ignore’ the fact 
that large values are more likely to be selected. then we will erroneously 
over-estimate the superpopulation mean, A. 


Figure 9.2 shows the predictive density f(y,;, | D) , again under the two 
assumptions. 


As in Figure 9.1, we see that ignoring the length-biased sampling 
mechanism tends to bias the inference upwards. 


As a check on our calculations, which omitted all terms corresponding to 
values of y,, greater than M = 100 (see above), we calculate the 


predictive mean of y,, under the SRSWR assumption using the formula 


1 M 
EQ, ID) e — 2 YaK ar) 


rT=0 
and obtain the value of 14. 


This may be compared with the theoretical value, which is exactly 
E(y,, | D) = E(EQy,, | D,A)| D] = E(mA|D) 
amid tX os ü (34-1) 
tT+d 0+2 


=14. 
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Figure 9.1 Posterior densities of the Poisson mean 
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l 
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l 
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Figure 9.2 Predictive densities of the nonsample total 


= 
o | *. 
- e>» 
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v 
kej e LJ eo 
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e o e. 96 
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R Code for Exercise 9.4 
options(digits=5); X11(w=8,h=5); par(mfrowzc(1,1)) 


N=9; n = 3; ys=c(1,3); ysT = sum(ys); d = 2; m = 7; eta=0; tau=0; yrTv=0:100 
kv = ((N-d)/(tau+N))4yrTv *gamma(etat+ysT+yrTv)/ 

( factorial(yrTv) * (ysT+yrTv)4n ) 
c = sum(kv); fv = kv/c 


plot(yrTv,fv,pch=16, xlab="nonsample total", 

ylab="predictive density",xlim=c(0,60), mainz" ") 
kvigno = ((N-d)/(tau+N))“yrTv *gamma(eta+ysT+yrTv)/( gamma(yrTv+1) * 1) 
cigno = sum(kvigno); fvigno = kvigno/cigno 
points(yrTv,fvigno,pch=1) 
legend(20,0.1,c("Length-bias assumed (Inference is correct)", 

"SRSWR assumed (Inference is too high)"),pch=c(16,1)) 
c(sum(yrTv*fv), sum(yrTv*fvigno) ) #5.6302 14.0000 
m*(etat+ysT)/(taut+d) # 14 


lamv=seq(0,10,0.01); lamfv=lamv 
for(i in 1:length(lamv)) lamfv[i]J=sum(fv*dgamma(lamv[i],etat+ysT+yrTv,tau+N)) 
plot(lamv,lamfv,type="I", Ityz1, lwd=3, 
xlab="lambda", ylab="posterior density", mainz" ") 
lamfvigno=lamv 
for(i in 1:length(lamv)) 
lamfvigno[i]=sum(fvigno*dgamma(lamv([i],eta+ysT+yrTv,tau+N)) 
# lines(lamv,lamfvigno,lty=2,lwd=1) # Can do as a check on calculations 
lines(lamv,dgamma(lamv,etat+ysT,tau+d),lty=2,lwd=3) 
legend(4,0.5,c("Length-bias assumed (Inference is correct)", 
"SRSWR assumed (Inference is too high)"),|ty=c(1,2),lwd=c(3,3)) 
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Exercise 9.5 An exponential finite population model with a 
biased Poisson sampling scheme 


A sample is drawn from a finite population of size N = 7 in such a way 
that unit i has probability of inclusion 7; , independently of all the other 
units. 


The values in the finite population are independent and identically 
distributed exponentials with mean 4 —1/ 4 , where the prior distribution 
for A is given by 

f(A)ec1/ 4,4» 0. 


Units 3 and 5 are selected, and their values are 1.6 and 0.4, respectively. 


Find and sketch the posterior density of the superpopulation mean 4 and 
the predictive density of the finite population mean y under each of the 
following specifications: 


(a) All the z, values are equal to 0.3 (i = 1,...,N). 


(b) All the z, values are equal to 0.3 except that: 
z edad yd 
z, =0.9if y, >1 
(thus unit 5 is 3 times as likely to be sampled if its value exceeds 1). 


(c) All the z, values are equal to 0.3 except that: 
m,=0.3if y, «1 
z, =0.9if y,^1 
(thus unit 4 is 3 times as likely to be sampled if its value exceeds 1). 


Note: Here, the sample size n is not fixed and is a random variable. 
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Solution to Exercise 9.5 


(a) The relevant Bayesian model is: 


faly.- [fav [a-r nn, 


N 
f i122 [4e ", y»0Vi 


i=1 


FG) 1/340. 


Here, 

T =... = Ty = 03, 
and the data is 

D =(1, y,) =((0,0,1,0,1,0,0), (1.6, 0.4)) , 
with n = 2 (the achieved sample size). 


The sampling mechanism is ignorable and so 


FAIDE AFYA T ie? = Ane 


 (4|D) ~ G(n, Yr) 
> (u |D) ^ IG(n, Yr). 


Next, 
(Yr |4) ~ G(m, A), 
where m = N-nz7-2-5. 


It follows that 
fO, |D) =f foy 1D.) fA | D)dA 


oc | tyre x Ale dA 
0 


= ye rtg ted 
0 
— ya L(n4 m) 
(Yr + Yan) 


m-1 

Yi 

c —PÀ y, »0. 
(Yir +Y)” i 


n+m 
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(using the fact that y,, = Ny — ny, ). 
(b) In this case, inferences will be exactly the same as in (a). This is 
because, even though the sampling mechanism is potentially nonignorable 


due to f(I|y,4) depending on a population value y., that value happens 
to be known (since unit 5 is in the sample, i.e. 5e s). 


To clarify, we write 


0.3, y, «1 
Ts =m; (y;)= 


=0.3+0.6I(y. > 1). 
0.9, y, > | Os 2D 


Then, noting that I. 21 and y, = 0.4, we have that 
fUs|y.4)= z20-7z,) ^ =a, = 0.34 0.6I(y, >1) = 03. 


Thus 
N 
füly.A)- BEA CAN 
i=1 


doesn't depend on A or y, and is completely known. 


Therefore, 


FAID)« fALYJ=| FAL y, yy, 
= [fO fy. Of Ly, y, dy, 
x FO f Gr UD f Gr 14)dy, © f) f Qr, 12) x1. as before in (a). 


(c) In this case, the sampling mechanism is nonignorable and inferences 
will be different to those in (a), because f(l|y,A) depends on a 
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population value y, which is unknown (since unit 4 is not in the sample, 
ie. 4er); tatis, f (I| y, 4) is unknown. To clarify, we write 


0.3, y, «1 
T,= m,(y4) T 


- 0.34 0.61I(y, » 1). 
09 y = | (Y.> 1) 


Then, noting that I,=0 and y, is unknown, we have that 
fü, ly,A)- zi ü-z,) " 21-z, -0.7-0.6I(y, » 1) 
(a function of y,). 


N 
So f(I\y,4)=]] fü; 1y, A). is unknown. 


i=1 
With this in mind, we now write 


FAID)« f(4,1,y,)= | fL y, y, dy, 


= | FASO DYNA FAY y, Ay, 


e f (Of CQ. AW, 
where 


W -WQ) - [| fO, V2 f y)dy, 


=| JI] fos Lay, |f fO LY f Qa Lydd, 


oo 


=| [I1] He ^07 -0.61(y, > ay, 


ier 0 


iz4 


since f(y,|4) 2 Ae ^?"vi 


- 0.7 | 4e ^*dy, -0.6 4e ^*dy, 
1 


0 


-0.7x1-0.6e ^. 


Thus 
F(A |D) œ Ale ^? (7—6e ^) 2 7A" 1e ^" 43e 02 | 
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Thus 
d I (n) (Ver +1)" T(n) 
where 


7 ()-- S) 


Ysr (Yar +D" 
HGR ae 
Yr gt) 


Note 1: The posterior f (4| D) is a weighted average of two gamma 
densities where one of the weights is negative. 


i f (iD =e] 


Note 2: The posterior density of 4 21/4 is given by 
f(u|D)= f(4=1/ u| D)/ uw. 


We now turn our attention to the predictive distribution of the nonsample 
total. Observe that 


fG,1D)- [ fo, ID. f&1D)44, 
where 


fo, 1D.) fors vL Aed - 610,» 0 ie b 


ier 


Ae ^" . 


iz4 


This suggests that we decompose the nonsample total according to 

Yer 7 Yo * Va 
(where y, is the total of all values in y, except for y,) and think about 
how we can use the following facts: 

(Ya y (DA) ( y, is independent of all other nonsample 

units, given D and A) 
(y,|D,4) ~ G(m-1,A) (a simple distribution) 
f (41D, A) <[7-6I(y, > Die, y, > 0 
(a complicated distribution). 
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One strategy is to use these facts to obtain the cdf F(y,| D, A), hence 
F(y4|D,A), hence f(y4,|D,A4)2 F'(y4|D,A), hence f(y|D,A), 
and hence ultimately the required f (y| D) 2 f f (Y| DA) f (4| D)dA. 


Y4 
7| Ae ^ dt, 0«y,«1 
First, F(y,|D,A)« » : 
7|4e"dt-6[Ae "d, — y,»1 
0 1 


|. - e^"), 0« y, «1 
7(1-e?*)-6(e ^" —e^^), ^ y,»1. 


k(7 - 7e ^), 0c y, «1 


Thus F D,A)= 
OA DA) aa y, > 


where k = k(A) 21/(7 — 6e ^), since 1= F(y, =|D,2)=k(7-6e”). 


Check: Since (y, | D, A) is continuous we would expect that 
BQ j=l DA) by = 1 A=, 


The left hand side here is 
k(7—e^ —6e^)-k(7-7e ^) =k(7 —7e*) -k(7- 7e ^) =0 
(which is correct). 


Next, writing a = y,, for notational convenience, we have that 
F(a |D,A) = P(Y, + y, € a|D,A) 
= E(P(y, + y, <a|D,A,y,)|D,A} 


= [PG, <a- y, |D, 4.) f (Ya | D. dy, 
0 


(a convolution) 
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For the case a > 1 we find that 


a-i 
F(a|D,A)= ['k|7-e^*? —6e | fomir) 
0 


+ f k|7 — jer | fems.ay(Yo)4Vo . 


a-1 


Note Ifa Land 0< y, <a-—1 then 1<a-y, <a. 
Ifa>1and a-1< y, <a then 0<a-y,<1. 


Check: Since (a | D, A) is continuous we would expect that 
Fa al! [oy Geil A SUR 


The LHS here is 


$3 
| | k [7 == o 6e | fac 9)dYs 
0 


1 
B [ k [7 fe 7e ^o) E TUE 
ie 


1 
- f k|7 — 7e C9 oaa) dy, = 0 (which is correct). 


0 


We now consider Leibniz's rule for differentiating an integral: 


d gb) b(y) Q 


pol Alay CoO 


+b'(y) f (by), y) - a Cy) F(a), y) 
(where the symbols here are not directly related to those in this exercise). 


Applying this rule for the case 0 « a < 1, we obtain 
dF(a|D,A) f zii 
f(a|D,A) = Tp = [k[0- 7e7*9 C4) ] faci Wor 
0 


da Alia nA 
s [7 -7e 3 fam (0) (this is zero) 


dO AGE ba 
Ec [7 -7e^ *] fama) (0) (this is zero) 
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ay -1 m2 oper E P a 7 
=Tkhae T dy, = 7kAe ^ — —— | yd 
[^ (£2 Ed e a y, 


253 m, m-1l,-4Aa 
^ gm EET = std): 


-7kAe ^ 
(m - 1)! (m - 1)! 


Likewise, for the case a > 1, we obtain 


dF(a|D,A) _ “F P" 
fapa- S1: - f k[0—e7*79(-4) -0 | fom- Yoo 
0 
QUT — e740) _ 6e” | MENT -1) 
da 
dO ide - ga 
-uV Me) — Ge ^] fca L4, (0) (this is zero) 
+ [ k[0-7e7*79 4) | foom-say(YoI Vo 
a-i 
ea 7e ge fca a) (0) (this is zero) 
" d(a — 1) k|7 E or) E =f} 
da 
m-1,,m-2 ,— 
=ke” Í jo £207 , -k|7-e7 -6e^| fois aaa — D) 
ET A" y -l],,m m2 oZ 
M7kAe * | je (Eu Tm D e, -k[1- e * | fatwa) 
gp = D”? -7kü-e- 1 
skle” om D + he Hemoa D 
7kAe ^ A" [an "| -7kü-e 1 
+Tkae* —— C070] cease - D 


m q1] 5-A(a-1) m ,m-l,-A4a 
= ke A™(a-1)""e 47k A"a" e 
(m -1)! (m — 1)! 
zipper A" (a = i ar 
(m —1)! 


x: U foma (0) - 6e ^ foma (a -Dj. 
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In summary so far, 


: ud 0«ac«1 
f (a| D, A) - kx foma (9) ] 
7 fai (0) - 68 " fou (aD, a1. 
Check: Here, 
kx| 7Exg D *7|1- Iced j|-6^[i- Foma -D]} 
1 


x(7- 6e^ [1-0] 


= 
j 
— 


| 7-6e^ 
(which is correct). 


Next, using the relationship y = (ny, + y, )/ N , we obtain: 


f (y |D.A) - fiy, A) 
= Nk(A)7 foma NY — ny, ) 


ny, y 
N 


for «y N 


f |D.2)7 f,(7,4) 
= Nk(A)[7 faq, (NY — riy;) - 6e ^ focn (NY - ny, -1)] 


for ym 


where: 


Ds 202857 
N 


ny, *l _ 9 4286 
N 


1 
k(A) = ———— (as before). 
=| ) 
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Thus we finally obtain the required posterior predictive density: 
= - Pee ny, __ ny,41 
KGiD- aeu GLfGIDAA, Mey E 


f(1D)-g,)- G.A FAI D)d2, y> ZE, 


f7 6e NEG 
where raiy- [T SH 


yi 6 
X4—— A)-——— A 
z TP ) (y +1)" fornya ] 


(as obtained earlier). 


Figure 9.3 shows the two densities f (u |D) and f(y |D) under each of 
the scenarios in (a) and (c). 


We see that inferences under the length-biased sampling scheme in (c) are 
lower than those under SRSWR in (a). This is because, generally 
speaking, length bias makes larger units more likely to be selected, and 
not adjusting for that bias naturally leads to inferences that are too high. 


These patterns are consistent with the following point estimates as 
obtained numerically (see the R code below for details of the calculation): 


E(u|D) = 1.38 in (c) « E(u|D) = 2.00 in (a) 
E(y |D) - 1.19 in (c) « E(y|D) = 1.71 in (a). 


Note 1: In (a), 
(u|D) ~ IG(n, y,7), 
and therefore 
E(u |D)= yr (n=1)=27/(2—1)=2 (exactly). 


Note 2: The posterior predictive mean of y in (c) was obtained 
numerically as follows: - 


[v9] 


N 
Y=E(y|D)= | d+ | ya,0Ddy 
y, wa 
N 


N 
= 0.01140 + 1.17546 = 1.1869. 
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Figure 9.3 Posterior and predictive densities 
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*-—  f(mu|D) under SRSWR in (a) 

— = f(mu|D) under length-biased sampling in (c) 
*** f(ybar[D) under SRSWR in (a) 

—— f(ybar|D) under length-biased sampling in (c) 


The dotted vertical line shows the minimum possible 
value of ybar which is (n*ysbar+0)/N = 0.286 


R Code for Exercise 9.5 


# (a) 


X11(w=8,h=4); par(mfrow=c(1,1)) 


mu & ybar 


N=7; ys=c(1.6,0.4); ysT=sum(ys); ysbar=mean(ys); n=length(ys); m=N-n 


c(ysT,ysbar,n,m) #2125 


fmufun=function(mu,n,ysT) dgaamma(1/mu,n,ysT)/mu^2 
integrate(fmufun,0, Inf,n=n,ysT=ysT)Svalue #1 check 
muv=seq(0.0001,20.0001,0.005); fmuv= fmufun(muv,n=n,ysT=ysT) 
plot(muv,fmuv,type="I"",xlim=c(0,20)) # check 
integrate(function(mu,n,ysT) mu*fmufun(mu,n,ysT), 
0,Inf,n2n,ysT-ysT)Svalue #2 check (posterior mean of mu) 
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kybarfun=function(ybar,n,N,ysbar) (ybar-(n/N)*ysbar)4(N-n-1) /  ybar^N 
const = integrate(kybarfun, (n/N)*ysbar , Inf,n=n,N=N,ysbar=ysbar)Svalue 
const # 0.4083333 
ybarv=seq( (n/N)*ysbar, (n/N)*ysbar+30, 0.005) 
fybarv= kybarfun(ybarv,n=n,N=N,ysbar=ysbar)/const 
plot(ybarv,fybarv, type="I",xlim=c(0,20)) # check 
(1/const)*integrate(function(ybar,n,N,ysbar) ybar*kybarfun(ybar,n,N,ysbar), 

(n/N)*ysbar,Inf,n=n,N=N,ysbar=ysbar)Svalue 

# 1.714286 (predictive mean of ybar) 


# (c) 
c=1/ ( 7/ysT^n - 6/(ysT+1)4n_); c #0.9230769 
flamfunc=function(lam,n,ysT,c) c* 

( (7/ysT^n)*dgamma(lam,n,ysT) - (6/(ysT+1)4n)*dgamma(lam,n,ysT+1) ) 
integrate(flamfunc,0,Inf,n=n,ysT=ysT,c=c)Svalue #1 check 
lamv=seq(0,20,0.01) 
plot(lamv,flamfunc(lamv,n=n,ysT=ysT,c=c),type="I") # OK 
fmufunc=function(mu,n,ysT,c) c*(1/mu^2)* 

( (7/ysT^n)*dgamma(1/mu,n,ysT) - (6/(ysT+1)4n)*dgamma(1/mu,n,ysT+1)_ ) 


integrate(fmufunc,0,Inf,n=n,ysT=ysT,c=c)Svalue #1 check 

integrate(function(mu,n,ysT,c) mu*fmufunc(mu,n,ysT,c), 
0,Inf,n=n,ysT=ysT,c=c)Svalue # 1.384615 (posterior mean of mu) 

fmuvc=fmufunc(mu=muv,n=n,ysT=ysT,c); plot(muv,fmuvc) # OK 


ybarmin=ysT/N; ybarmin # 0.2857143 Minimum possible value of ybar 
ybarcut=(ysT+1)/N; ybarcut 4 0.4285714 Cut-point for ybar 


fifun=function(ybar,lam,n,N,m,ysT) (N / (7-6*exp(-lam))) * 
7*dgamma(N*ybar-ysT,m,lam) 
f2fun=function(ybar,lam,n,N,m,ysT) (N / (7-6*exp(-lam))) * 
(7*dgamma(N*ybar-ysT,m,lam)-6*exp(-lam)*dgamma(N*ybar-ysT-1,m,lam) ) 


# Check for particular values of lambda 

lam=0.764 # (example in the range ybarmin to ybarcut) 

p1 = integrate(fifun, ybarmin,ybarcut, lam=lam,n=n,N=N,m=m,ysT=ysT)Svalue 
p2 = integrate(f2fun, ybarcut, Inf, lam=lam,n=n,N=N,m=m,ysT=ysT)Svalue 
c(p1,p2,p1+p2) # 0.001921853 0.998078147 1.000000000 OK 


lam=3.214 # (example in the range ybarcut to infinity) 

p1 = integrate(f1fun, ybarmin,ybarcut, lam=lam,n=n,N=N,m=m,ysT=ysT)Svalue 
p2 = integrate(f2fun, ybarcut, Inf, lam=lam,n=n,N=N,m=m,ysT=ysT)Svalue 
c(p1,p2,p1+p2) # 0.2298026 0.7701974 1.0000000 OK 
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g1fun=function(ybar,n,N,m,ysT,c) 
integrate(function(lam,ybar,n,N,m,ysT,c) 
fifun(ybar,lam,n,N,m,ysT)*flamfunc(lam,n,ysT,c), 
0,Inf, ybarzybar, n=n, N=N,m=m,ysT=ysT,c=c)Svalue 
g2fun=function(ybar,n,N,m,ysT,c) 
integrate(function(lam,ybar,n,N,m,ysT,c) 
f2fun(ybar,lam,n,N,m,ysT)*flamfunc(lam,n,ysT,c), 
0,Inf, ybarzybar, n=n, N=N,m=m,ysT=ysT,c=c)Svalue 


# Check: 
g1fun(ybar=0.4,n,N,m,ysT,c) #0.4119163 OK 
g2fun(ybar=0.6,n,N,m,ysT,c) #1.274185 OK 


ybarv1=seq(ybarmin,ybarcut,length.out=400); fybarvi=ybarv1 
for(j in 1:length(ybarv1)) fybarv1[j] = 
g1fun(ybar=ybarv1[j],n=n, N=N,m=m,ysT=ysT,c=c) 


ybarv2=c( seq(ybarcut,1,length.out=200), seq(1,2,length.out=200), 
seq(2,3,length.out=200), seq(3,5,length.out=200), 
seq(5,10,length.out=200), seq(10,50,length.out=200) , 
seq(50,1000,length.out=200), seq(1000,10000,length.out-200) ) 
fybarv2=ybarv2 
for(j in 1:length(ybarv2)) fybarv2[j] = 
g2fun(ybarzybarv2[j],nzn, N=N,m=m,ysT=ysT,c=c) 


plot(c(0,5),c(0,1.5),type="n") 
lines(ybarv1, fybarv1,lty=1,lwd=2) 
lines(ybarv2, fybarv2,lty=1,lwd=2) # OK 


# Check 

INTEG <- function(xvec, yvec, a = min(xvec), b = max(xvec)){ 

# Integrates numerically under a spline through the points given by 

# the vectors xvec and yvec, from a to b. 

fit <- smooth.spline(xvec, yvec); spline.f <- function(x){predict(fit, x)Sy } 
integrate(spline.f, a, b)Svalue } 
INTEG(seq(0,1,0.01),5eq(0,1,0.01)^2,0,1) #0.3333333 check 


prob1=INTEG(ybarv1,fybarv1,ybarmin,ybarcut) 
prob2=INTEG(ybarv2,fybarv2,ybarcut,10000) 

c(prob1,prob2,prob1+prob2) # 0.02880659 0.97119399 1.00000058 OK 
INTEG(c(ybarv1,ybarv2),c(fybarv1,fybarv2),ybarmin, 10000) 4 1.000004 OK 
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X11(w=8,h=6); par(mfrow=c(2,1)) 
plot(ybarv1, ybarv1* fybarv1, xlim=c(0,1)) # OK 
plot(ybarv2, ybarv2* fybarv2, xlim=c(0,20)) # OK 


term1 = INTEG(ybarv1, ybarv1*fybarv1,ybarmin,ybarcut) 
term2 = INTEG(ybarv2, ybarv2*fybarv2,ybarcut, 10000) 
ybarhatc = term1 + term2; c(term1, term2, ybarhatc) 
# 0.01139601 1.17546200 1.18685801 (predictive mean of ybar) 


X11(w=8,h=8); par(mfrowzc(1,1)) & Produce final plots 
plot(c(0,5),c(0,1.3),typez"n",xlabz"mu & ybar", 
ylab="posterior & predictive density") 

lines(muv,fmuv,lty=4,lwd=3,col="green") # mu under SRS 
lines(muv,fmuvc,Ityz2,Iwdz3, col="red") # mu under length-biased sampling 
lines(ybarv,fybarv, Ity=3,lwd=3, col="blue") # ybar under SRS 
lines(ybarv1, fybarv1,lty=1,lwd=3); lines(ybarv2, fybarv2,Ityz1,Iwdz3) 
abline(v=(n/N)*ysbar,|ty=3); (n/N)*ysbar # 0.2857143 
legend(2,1.3,c("f(mu|D) under SRSWR in (a)", 

"f(mu |D) under length-biased sampling in (c)", 

"f(ybar|D) under SRSWR in (a)", 

"f(ybar|D) under length-biased sampling in (c)"), 

Ity=c(4,2,3,1),lwd=rep(3,4),col=c("green","red","blue","black")) 
text(3.5,0.75,"The dotted vertical line shows the minimum possible") 
text(3.5,0.68," value of ybar which is (n*ysbar+0)/N = 0.286") 


Exercise 9.6 A Gibbs sampler for solving a length-biased with- 
replacement model 


Consider the Bayesian model in part (c) of Exercise 9.5, namely: 


PAIA] fayd] [ a-a", =D, 


fOla=[ Jae, f(s1/2,4»0, 


i=1 
where: N=7, m, =0.3 Vi=1,2,3,5,6,...,N 
7, 7 0.3if y, « 1and z, 20.9if y,» 1 
D -(1,y,) = ((0,0,1,0,1,0, 0), (1.6,0.4)), n=2, m2 N-n -5. 


Design and implement a suitable Gibbs sampler so as to obtain a random 


sample from the joint distribution of 4 -1/4 and y . Illustrate your 
results with suitable plots and estimates. 
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Solution to Exercise 9.6 


Motivated by and using results in the previous exercise (Exercise 9.5), 
define y, = y,, — y, and then note that at the observed value of the data, 
the Bayesian model implies that: 

f ys yo y, 4) 7 -6I(y, 71) 

FO lY Yo) ~ GLA) 

fly, 4)" G(m-1,4) 

CADES IEG 


i=1 


f (A) «1/ 4,4» 0. 


So 


1 : —Ayi m-1,,m-2,- 
sve deiTe "pk ty! 2o g 


i=1 


xAe ^" x[7  6I(y, >1)]. 


We see that a suitable Gibbs sampler is defined by the following three 
conditionals: 


1. f (4 | I; Yo Y4) oc A mm oA + ¥0+ Ya) _ Aster 


SA LV Y4) WCU y) "GUN Yr y. y) 


2. f Cy, | I,y. 4.4) oC ye 
=> (QuLysA y)» GG -12) 


3. f (y, lL yoA y) «[7-61(y,» )]4e ^^, y,»0. 


The first of these three conditionals are straightforward and easy to sample 
from. The third conditional can be sampled from via the inversion 
technique as follows. 


First, for notational convenience, write the relevant random variable as x 


with density 
f Go [7-61(x»1)Ae ^, x»0. 
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Then the cdf of x is 


7 f Ae ^'dt, 0<x<1 
F(x)=rj ; . 
7 f Ae "dt — 6 f Ae "dt, x>1 
0 1 


for some constant r 


e" 7(1— e^), 0<x<1 
7(1-e*)-6(e" —e ^), x»1 


which equals 1-7 —0-—6e^ in the limit as x > œ; so r -1/(7 — 6e ^). 


T-ya" 


Now observe that F(x =1)= ar 
7 —6e 


This is a constant in the formula for the quantile function of X, obtained 
as follows. 


-A 


then we solve p ^ r(7-7e ^) 


7 
First, if p< 
P 7=6e 


and thereby obtain x = ~ Flog — 2| , 
A 7r 


7—7e^* 
Secondly, if p> a then we solve p 2 r(7—e ^ —6e ^) 
— 6e 


and thereby obtain x — -=log [7 - 6e” — 2) A 
r 


In summary, the quantile function of x is given by 


1 p 4 7 yg 
x= log} 1 7 —6e : « 
A ef 7 ( ) PS ge^ 


Q(p)= . (9.2) 


Tale” 
7-608" 


xe 7 log(7 Ge ^ p(7 6e^)), p> 
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So a procedure for sampling from the third conditional in the Gibbs 
sampler, namely 


fO Lys A ye[7-61(y, > ]4e ^^, 
is to draw u ~ U(0,1) and then return y, = Q(u) as per equation (9.2). 


Implementing the above Gibbs sampler for 20,000 iterations following a 
burn-in of 1,000 and then thinning out by a factor of 10 we obtained a 
random sample of size J = 2,000 from the joint posterior/predictive 


distribution of f(A, y, y, | L; y,). 


Figure 9.4 displays trace plots for the three unknowns, 4, y,, y,, sample 


ACFs for these over the last 20,000 iterations, and the three sample ACFs 
again over the final samples of size J. Figure 9.5 is a histogram of the J 
simulated values of 4 —1/4 and Figure 9.6 is a histogram of the J 


simulated values of y = (Yr + yy + y,)/ N . In each histogram are shown 


a density estimate as well as three vertical lines for the Monte Carlo point 
estimate and 9596 CI for the mean. 


The posterior density of u, ie. f(u|D), was estimated via Rao- 
Blackwell as 


^ 1 
f (| D) m eiae) D> 
j= 


where 
y? 2 Qa t+ yi * y )N, 
using the fact that 
(u| I, Y. Yo Y4) IGUN Yor + Yo + y4) IGU S) IG(N, NY). 


The posterior mean of u, ie. E(u|D), was also estimated via Rao- 
Blackwell as 


= 1.41, 


Pis ey 1. yo 
& Nei JAN- 


J 


Sd 
P J 
using the fact that 


E(u|L y, yo Va) = Ver + Yo +y4)/ (N -1), 
with 9596 CI for the posterior mean equal to 


J (i) 
peros ft F a J - (1.34, 1.47). 
EPA 


453 


Bayesian Methods for Statistical Analysis 


Note: This is consistent with the exact value, namely E(u|D) = 1.38, 
as obtained in Exercise 9.5. 


The predictive density of y was estimated by smoothing a probability 


histogram of the simulated values y^, and the predictive mean of y , i.e. 
E(y | D), was estimated by 


y-— ayy? = 4.51, 


j=l 


with 95% CI 


J 
(Ss 96 Hm 


y?- iy |- = (1.15, 1.26). 
J-1 » 


J 


Note 1: This is consistent with the exact value, E(y | D) = 1.19, as 
obtained in Exercise 9.5. 


Note 2: We may be able to improve on the above ‘histogram’ estimation 
of E(y | D) using Rao-Blackwell methods. For example, observe that 


= 1 m-1 
Ey | DA, y) xs ty. mt). 


So we define 


g SEU DA y = ret za) 
Ji 


The associated Rao-Blackwell estimate of E(y | D) is 


i J 
&€-—Y'e, = 121, 


j=l 


e +1.96 : €; e FIG 1.26 . 


Note 3: In this case, applying Rao-Blackwell methods has only slightly 
narrowed the CI for E(y | D). 
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Figure 9.4 Trace plots and sample ACFs 
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Figure 9.5 Inference on the superpopulation mean 
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R Code for Exercise 9.6 


Qfun = function(p=0.5,lam=1){ 
c1 = (7-7*exp(-lam))/(7-6*exp(-lam)) 
if(p«- c1) c2 = 1- (p/7) * (7-6*exp(-lam)) 
if(p» c1) c2 = 7 - 6*exp(-lam) - p*(7-6*exp(-lam)) 
-(1/lam)*log(c2) } 


# Check: 

pvec=seq(0,1,0.001); Qvec=pvec 

for(i in 1:length(pvec)) Qvec[i] = Qfun(p=pvec[i],lam=1.3) 
plot(pvec,Qvec); plot(Qvec,pvec) # OK 


GS = function(J=1000,N=7,n=2,m=5, ysT=2, lam=1,yO=1,y4=1){ 
lamv=lam; yOv=y0; y4v=y4; for(j in 1:J){ 
lam=rgamma(1,N,ysT+y0+y4) 
yO=rgamma(1,m-1,lam) 
u=runif(1); y4=Qfun(p=u,lam=lam) 
lamvzc(lamv,lam); yOv=c(yOv,yO); y4v=c(y4v,y4)_ } 
list(lamv=lamv, yOvzyOv, y4v=y4v) } 


X11(w=8,h=9); par(mfrow=c(3,3)); set.seed(321); date() 
res= GS(J=21000,N=7,n=2,m=5, ysT=2, lam=1,y0=1,y4=1); date() 4 took 3 secs 
plot(resSlamv,type="I"); plot(resSyOv,type="I"); plot(resSy4v,type="I") # OK 


lamv=resSlamv[-(1:1001)]; yOv=resSyOv[-(1:1001)]; y4v=resSy4v[-(1:1001)]; 
acf(lamv); acf(yOv); acf(y4v) # high serial correlation, so need to thin out 
inc= seq(10,20000,10); lamvec=lamv[inc]; yOvec=yOv[inc]; y4veczy4v[inc]; 
acf(lamvec); acf(yOvec); acf(y4vec) # OK 

J = length(lamvec); J # 2000 


N=7;n=2;m=5; ysT=2; muvec=1/lamvec; ybarvec=(1/N)*(ysT+yOvecty4vec) 
ybarhat=mean(ybarvec); 
ybarci=ybarhat+c(-1,1)*qnorm(0.975)*sd(ybarvec)/sqrt(J) 

c(ybarhat, ybarci, ybarci[2]-ybarci[1]) # 1.204519 1.151619 1.257419 0.105800 


evec=(1/N)*( ysT+ y4vec + (m-1)/lamvec ) 
ebar=mean(evec); eci= ebartc(-1,1)*qnorm(0.975)*sd(evec)/sqrt(J) 
c(ebar,eci,eci[2]-eci[1]) # 1.2091236 1.1581903 1.2600569 0.1018666 


muhat=(N/(N-1))*ybarhat 


muci=muhat + c(-1,1)*qnorm(0.975)*sd( (N/(N-1))*ybarvec ) / sqrt(J) 
c(muhat, muci) # 1.405272 1.343556 1.466989 
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mugrid=seq(0.001,10.001,0.01) 
fmuhat=mugrid; for(i in 1:length(mugrid)) 
fmuhat[i] = mean( dgamma(1/mugrid[i], N, N*ybarvec )/mugrid[i]^2 ) 


X11(w-8,h-5) 


hist(muvec,prob-T, xlim=c(0,5),ylim=c(0,1),breaks=seq(0,80,0.1), 
xlabz"mu", main="") 
lines(mugrid,fmuhat,Iwdz2); abline(v= c(muhat, muci), Iwdz2) 


hist(ybarvec, prob=T, xlim=c(0,5),ylim=c(0,1.2),breaks=seq(0,80,0.1), 
xlab="ybar", main=" ") 
lines(density(ybarvec),lwd=2); abline(v= c(ybarhat, ybarci), lwd=2) 


Exercise 9.7 Gibbs sampler for a length-biased without- 
replacement sampling model 


Earlier we defined L =(L,,...,L,,) as the vector of the labels of the selected 
units in the order in which they are sampled. 


Now consider the following Bayesian finite population model: 


n Ve 
f(L I». 2 T] ——=—_ beth. elt, sd; 
PLE Euh 


a, €{1,..., N} Vie(L..,n)&a, za; Vi, j e(L..,n)] 
N 
f(y|a=[ [re y, >0vi 
i=1 


f(A)sc1/ 4,4» 0. 


Design and implement a suitable Gibbs sampler so as to obtain a random 
sample from the joint distribution of 4 -1/ 4 and y in the case where 
N-7,n-3, m- N-n -4 
and when the observed data is 
D — (L, y,) = ((4,3,6), (1.6,0.4,0.7)).. 


Illustrate your results with suitable plots and estimates. 
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The sampling mechanism here is defined by the model density of L, which 
may also be written as 
Jr, J L, y D 


E A E kiak 
Yr Yr Yr Yn Ju Yro Yr T Yg eT Yna 


for L = (1..,n),(5,32,..,n),., (N, N -,.., N-n +1). 


This pdf implies that units are selected from the finite population, one by 
one and without replacement, in such a way that the probability of 
selecting a unit on any given draw is its value divided by the sum of the 
values of all units which have not yet been sampled at that point in time. 
We call this procedure length-biased sampling without replacement. 


Note: This is an example of a sampling mechanism that is nonignorable 
but known. If f(L|y,2) depended on 4, or on some other unknown 
quantity, then we would say that the sampling mechanism is 
nonignorable and unknown. 


In the present case it is convenient to relabel the population units—after 
sampling—in such a way that L = (1,2,...,n) and so also s - (1,...,n) and 
r=(n+1,...,N). Assuming that this is done, we may write the density of 
the sampling mechanism in various other and simpler ways, for example: 
OAR) r e og s 
Yr Vr 7M Vr A7 XY; Yr TYT Ya 9 Yna 
= Yi y Y» - J3 AT Yn 
Pitat Yy Votat Yy Yates Yu Y otet Yy 


n 


-[L—— -[I Yi ete, 


N b 
id Y; t+ ES i-i 2229) 


Note: We have not previously relabelled population units in this manner 
because doing so would have provided only marginal notational 
convenience and may have obscured the nature of the sampling 
mechanisms we were trying to illustrate. In the next chapter, we will 
again make use of a convenient relabelling scheme similar to the one 
applied here. 


With the above relabelling in place, and noting that 
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(ya 14) ~ G(m, A), 
the joint posterior density of A and y,, (given the data, D =(L, y,)) may 
now be written as 


f ya 1D) f ya VoL) = KAI IAF Yr AFL os 


«xx(ITie «aee «T 


i=1 yit.. + Ynt Yr 


This joint density suggests a Metropolis-Hastings algorithm with a Gibbs 
step defined by the conditional posterior distribution 

(A [Y 5) 22 G(N, y, +Y) 
and a Metropolis step defined by a rather complicated conditional 
predictive density defined by 

- 1 
f (Yr |D,A) « ys e ^?" 
i i Dreams rem 


At this point it is useful to recall a data augmentation technique based on 
the identity 


EE f xe "dw, 
or equivalently 
1 oo 
—-|e""dw, 
"x 


which can be oa here so as to yield the identity 


n 


Il -JI feet Ys I dw. 


ia Yit- TY, + Yu i=l 0 


This suggests that we introduce an artificial or latent random variable 
W — (W,,..., W,) into our model which is defined in such a way that the 


joint posterior density of A, y,, and w is given by 


foy Dye x T ae” )x (4"y A" nd petes XY 
i=1 


Note: If we integrate this joint density with respect to w then we recover 
f (4, y, | D) as above. 
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The above expression for f(A, y,,, w| D) now suggests a ‘pure’ Gibbs 
sampler defined by the following n * 2 conditional distributions: 


(4 |D, Yr W) ~ GIN, Yer + Y) 
(y,,|D,4,w) ~G(m,A+w,) where w, =w,+...+W, 
(w,|D,A ya) LGA y, tty, yo), iles. 


This Gibbs sampler can be used to generate a random sample 
(4, yP, w”) ~ iid f(A ys. wD), j= Lend: 
where 
w = (we? swe). 
This sample can then be used for Monte Carlo inference on the quantities 
of interest, namely u 21/4 and y (y^ y4)/N. 


Applying the above Gibbs sampler (with a suitable burn-in and thinning) 
we obtained a random sample of size J = 2,000 from the joint posterior 
distribution of A, y,; and w= (W,...,w,). 


The posterior density of 4 was estimated via Rao-Blackwell as 


x 1 
f (u|D) 2 | eani D> 
i- 


where 
y? 2(ya tyr) N, 
using the fact that 
(ull, y, ya, w)^ IG(N, y, oe las IG(N, y, ) v IG(N, Ny). 


The posterior mean of u was also estimated via Rao-Blackwell as 
pal y atv AS NI) Logis 
Je N- JAN- 
using the fact that 
E(u| Ly, ya; w) (Ys + Y4)/ (N -1), 
with 9596 CI 


. 1]|1 (Ny? Y 
151.96 1" DE : à) = (0.614, 0.624). 
y AA N- 
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The predictive density of y,, was likewise estimated via Rao-Blackwell 
as 


^ 1~ 
fOr 1D) “7 > : foim aam Yor) 
j= 


where 


(i) 


w = wt (i) 


Tunc Wes 


n 


The predictive mean of y,, was also estimated via Rao-Blackwell as 


E Le m 
=— ————— = 1.013, 
Yer J 2 +w 


Ja ej 
using the fact that 
BY |I, y, A,w) 2 m/ (A w,), 


with 9596 CI 


2 
J 
An He m =p] | =(0,993, 1.033). 
i ES Ier 


G 
ja\ Apt Wy 


These Rao-Blackwell estimates for y,, were then transformed into 
estimates for y via the equation 


y-*Qu tya4)/N. 


In this way, we estimated y's posterior mean by 0.530, with 9596 CI 
(0.614, 0.624). 


Figure 9.7 shows trace plots for A, y,, and w,, sample ACFs for these 


quantities over the last 10,000 iterations, and these three sample ACFs 
again but calculated using only the final smaller samples of size J — 2,000. 


Figures 9.8 and 9.9 (page 464) show two histograms, of the J simulated 
values of 4 —1/ 4, and of the J simulated values of y=(y,, + y,,)/N. 
In each histogram are shown a density estimate and three vertical lines 
representing the Monte Carlo point estimate and 9596 CI for the posterior 
mean. 


Note 1: The type of sampling mechanism which features in this exercise 
has applications in the analysis of oil discovery data. For further details, 
see West (1996). 
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Note 2: In this chapter, we have presented several examples of how 
Bayesian methods can be used to perform inference on an exponential 
finite population under biased sampling. For another such example, see 
Puza and O'Neill (2005). 


Figure 9.7 Trace plots and sample ACFs for samples obtained 
via MCMC 
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Figure 9.8 Inference on the superpopulation mean via MCMC 
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Figure 9.9 Inference on the finite population mean via MCMC 
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R Code for Exercise 9.7 


GS = function(J=1000,N=7,n=3,m=4, ys=c(1.6,0.4,0.7), 
lam=1,yrT=1,w=rep(1,3)){ 

ysT=sum(ys); lamv=lam; yrTv=yrT; wmat=w; for(j in 1:J){ 
lam=rgamma(1,N,ysT+yrT); 
yrT=rgamma(1,m,lam+sum(w)) 
for(i in 1:n) w[i] = rgamma(1,1,sum(ys[i:n])) 
lamv=c(lamv,lam); yrTv2c(yrTv,yrT); wmat=rbind(wmat,w) 
} 

list(lamv=lamv, yrTv=yrTv, wmat=wmat) 


} 


set.seed(321); date() 
res=GS(J=11000,N=7,n=3,m=4, ys=c(1.6,0.4,0.7), lam=1,yrT=1,w=rep(1,3)) 
date() # took 4 secs 


X11(w=8,h=9); par(mfrow=c(3,3)); 


plot(resSlamv,type="I"); plot(resSyrTv,type= 
lamv=resSlamv[-(1:1001)]; yrTv=resSyrTv[-(1:1001)]; 
wmat=resSwmat[-(1:1001),] 

acf(lamv); acf(yrTv); acf(wmat[,1]) # 


); plot(resswmat[,1],type="1") 


inc= seq(5,10000,5); lamvec-lamv[inc]; yrTvec=yrTv[inc]; wmatrix2wmat[inc,]; 


acf(lamvec); acf(yrTvec); acf(wmatrix[,1]) # OK 
J = length(lamvec); J 2000 


N=7;n=3;m=4; ys=c(1.6,0.4,0.7); ysT-sum(ys); 

muvec=1/lamvec; ybarvec=(1/N)*(ysT+yrTvec) 
wTvec=apply(wmatrix,1,sum) 

yrThat=mean(m/(lamvect+wTvec)) 
yrTci=yrThat+c(-1,1)*qnorm(0.975)*sd(m/(lamvect+twTvec))/sqrt(J) 
c(yrThat,yrTci) # 1.0131279 0.9930648 1.0331911 
ybarhat=(1/N)*(ysT+yrThat) 

ybarci=(1/N)*(ysT+yrTci) 

c(ybarhat,ybarci) # 0.5304468 0.5275807 0.5333130 


muhat=(N/(N-1))*ybarhat 
muci=muhat + c(-1,1)*qnorm(0.975)*sd( (N/(N-1))*ybarvec ) / sqrt(J) 
c(muhat, muci) # 0.6188547 0.6136692 0.6240401 
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mugrid=seq(0.001,10.001,0.01) 
fmuhat=mugrid; for(i in 1:length(mugrid)) 
fmuhat[i] = mean( dgamma(1/mugrid[i], N, N*ybarvec )/mugrid[i]^2 ) 


ybargrid=seq(0,10,0.01) 
fybarhat= ybargrid; for(i in 1:length(ybargrid)) 
fybarhat[i] = mean( dgamma(N*ybargrid[i]-ysT, m, lamvec+wTvec )*N ) 


X11(w=8,h=5); par(mfrow=c(1,1)) 

hist(muvec,prob-T, xlim=c(0,3),ylim=c(0,2.5),breaks=seq(0,80,0.1), 
xlabz"mu", mainz"") 

lines(mugrid,fmuhat,Iwdz2); abline(v= c(muhat, muci), Iwdz2) 

hist(ybarvec,prob=T, xlimzc(0.3,1.2), ylimzc(0,7),breakszseq(0,80,0.025), 


xlabz"ybar", mainz"") 
lines(ybargrid, fybarhat,lwd=2); abline(v= c(ybarhat, ybarci), lwd=2) 
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10.1 The basic normal-normal finite 
population model 


Consider a finite population of N values y,,..,y, from the normal 
distribution with unknown mean yw and known variance o^. Assume 
we have prior information about 4 which may be expressed in terms of 


a normal distribution with mean 44 and variance o; . 


Suppose that we are interested in the finite population mean, namely 
y 9 Cy, t... * Yy)/ N , and wish to perform inference on y based on the 


observed values in a sample of size n taken from this finite population 
via simple random sampling without replacement (SRSWOR). 


For convenience, we will in what follows label (or rather relabel) the 
n sample units as 1,.,n and the m-N-n nonsample units as 
n+41,...,N. This convention simplifies notation and allows us to write 


the finite population vector, originally defined by y — (y,,,..., Yy), as 
Y = (Qus Yn) aX = Or. y- 


Example: Suppose that we sample units 2, 3 and 5 from a finite 
population of size 7. Then we change the labels of units 2, 3 and 5 to 1, 
2 and 3, respectively, and we change the labels of units 1, 4, 6 and 7 to 
4, 5, 6 and 7, respectively. 


Thereby, instead of writing y, 2 (y), y, y;) and y, 2 (y, Yo Yo Y7), 
we may write y, 2 (y, Y2 Y3) and y, 2 (y, Ys» Yeo Y7), respectively. 


We will also implicitly condition on s = (s,,...,5,) at its fixed value and 
suppress s from much of the notation. Thus we will sometimes write 
f(|s.y,) as f(y|y,). with an understanding that s refers to the 
particular units which were actually sampled. 
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Our inferential problem may be thought of as prediction of y, given the 
data, y, (and s), since y 2 (y, - my,)/ N . Considering the various 
distributions that are involved, a suitable Bayesian model is: 
(y, [Yo H) ~ N(u,o^/m) 
(the model distribution of the nonsample mean) 
ene, a 
(the model distribution of the sample values) 
u~ N(1,,0;) (the prior distribution). 


This model will be called the basic normal-normal finite population 
model. By results for the normal-normal model reported earlier, we see 
that the posterior distribution of the superpopulation mean is given by 


(uly) ~ N(4,,0-), 


where: 4, =(1—k), + ky, (the posterior mean as a credibility estimate) 
2 


c n 
0; =k— (the posterior variance), k= 
n 


n+o° lo, 
(the credibility factor and weight given to the MLE, y, ). 


It will be recalled that in this context the predictive density of the 
nonsample mean is 


fO. 12e [fO ay, Df Gul y)da. 


But this is the integral of the exponent of a quadratic equation in zz and 
y, , and so equals the exponent of a quadratic equation in y, . It follows 
that 

(¥, |y.) * N(a,b*), 
where: d= E(y, | y,)= HEQ, |Y, 4) y,3 7 Etu| y, = 4h 

b’ =V(y, |y) »VtEQ, 1y,.40 | y3 EQ Y |y, 10 | y. 


Ys =0, +—. 
m 


2 


-viuis E E 
m 


It follows that (y | y,) ~ N(c, d^) , where: 


468 


Chapter 10: Normal Finite Population Models 


= ny + my nv +mE(V 
sU PA 
N N 
TY] Y 2 
= ny, +my m = 
d? =V =V |. 2|ly 

rq) | m 3 (=) CARs) 

2 


Then, the 1— 2 central predictive density region (CPDR) for y is given 
by (Ctz d). 


Summary: For the basic normal-normal finite population model: 


o? 
y, 3 VN 3 
CARAT) E 2 ) 


= 


(s Y.) id NOGE Yapi ~N (t) 

the posterior distribution of the superpopulation mean yz is given by 
(|y) * NQu,0:), 

n 


2 
where: 44 =(1—k) 41, +ky,, giebt k=—_._,.. 
n i^o 10, 


The predictive distribution of the nonsample mean y, is given by 
(y, |y ge N(a,b^), 


2 
o 

where: d = Lh, b’ -ol-—,m-2N-n. 
m 


The 1—@ CPDR for y, is (at z b). 


The predictive distribution of the finite population mean y is given by 
(VlyJ~N(cd’), 
ny, + mu. m? 


v 2 
where: c = ,d°=— | oa: +2 | (with u. and o2 as above). 
N N m 


The 1—-@ CPDR for y is (c+ z,,,d). 
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Exercise 10.1 Practice with the basic normal-normal finite 
population model 


Consider the Bayesian model given by: 
(F, 1x4) ~ N(u,0° / m) 
(Yo Y. |4) ~ tid N(u,0°) 
H~ N (th, 0%) . 


(a) Express the predictive mean of the finite population mean y as a 
credibility estimate with a suitable credibility factor. Then also express 
the predictive variance and distribution in terms of that credibility factor. 
Use your results to answer parts (b) through (e) below. 


(b) What is the predictive distribution in the case of very weak prior 
information? 


(c) What is the predictive distribution in the case of very strong prior 
information? 


(d) What is the predictive distribution in the case of a very large sample 
size? 


(e) What is the predictive distribution in the case of a census? 


(f) Suppose we believe with a priori probability 95% that w lies 
between 7.0 and 13.0. We sample the values 5.7, 9.6 and 8.3 from a 
finite population of seven units. Find the predictive mean and 9596 
highest predictive density region for the average of all seven values in 
the finite population if the superpopulation standard deviation is 2.0. 


Create a graph showing: 
(i) the likelihood function for the superpopulation mean 
(ii) the prior density of the superpopulation mean 
(iii) the posterior density of the superpopulation mean 
(iv) the prior density of the nonsample mean 
(v) the predictive density of the nonsample mean 
(vi) the prior density of the finite population mean 
(vii) the predictive density of the finite population mean. 


In your graph indicate the predictive mean and 9596 highest predictive 
density region for the average of all seven values in the finite population. 
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Solution to Exercise 10.1 


(a) It is easy to show that the predictive mean of y, namely 


- ny,+mu  ny,-(N -n)Id - k)u, + ky,] 
c= E(g |y) =e = ( » Jy Ky] 


may also be written as the credibility estimate 


C = (1-q)4, Ty. x 
where 


|. (n & (N - n)k 
N 
is the credibility factor, meaning the weight assigned to y, (the direct 


data estimate of y), and where 1—q is the weight assigned to 44 (the 
prior estimate of y ). 


It can then also be shown that the predictive variance of y, namely 
m? o^ 
d*=V(¥|y)=—| a +— |, 
(127 qz | " 
may be expressed as 


2 2 2 2 
CE c-r. 
N n N-n n N 


Thus we may also write the predictive distribution of the finite 
population mean as 


a co a m 
Iy.) ~ n{a-pu «eae - 2). 
n N 
where: 
SERO SOR 
N 
n 
E LN 
n+oa°/o, 


Note: If the original credibility factor k equals 1 then the second 
credibility factor q also equals 1. This then implies that we estimate y 


by 
c=(1-1)y, +1y, =y.. 
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This makes sense because if the sample data values are given ‘full 
credibility’ then their straight average should intuitively be used to 
estimate the finite population mean. 


On the other hand, if k = 0 then q = n/N (the sampling fraction). This 
then implies that we estimate y by 


c=(1-n/ N) yu, +(n/ NYY, = (N —n)u e ny;)/ N . 
This also makes sense because if the sample data are given 'zero 
credibility’ then each of the N-n nonsampled values should 
intuitively be estimated by the prior mean of the superpopulation mean 


171 


(b) In the case of very weak prior information we have (in the limit) that 
Oo, =, hence k= 1, and hence q = 1. Consequently 


= EET. aa n ce? n 
CARA) n(a-04 t (1-2) ni. (2-2}}. 


This implies a posterior mean and 1-« CPDR for y of y, and 


[ss E Zan = J z 


Note: This is the same inference one would make via classical 
techniques after substituting the sample standard deviation 


TL cad 
"xe =y) 


for o , assuming that n is ‘large’. 


(c) In the case of very strong prior information we have (in the limit) 
that o; — 0, hence k= 0, and hence q = n/N. Consequently, 


n n n = z) 
yy iina m9 iq. 
Ql) ( m NUN m 


(B9 s Es c n )} 
N N N 
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(d) In the case of a very large sample size we have (in the limit) that 
n —oo, hence k= 1, and hence q =1. Consequently (just as in (b) for 
the case of very weak prior information), 


ly) n| a-du a) 
n N 


“lS ta) 


(e) In the case of a census we have n = N, hence 
g= N+(N-N)k _ 


1, 
N 


and therefore 
o? x) 
y ~ N| -1 +1y ,1—|1-— 
QI) ( Hy +17, zl x) 
~ N(y,,0); 


meaning that y — y, with posterior probability 1 (obviously). 


Note: Some of the equations developed previously implicitly assume 
that n < N. 


(f) The given specifications imply that: 
n=3, N-7, m- N-n-4, o -2 


Y, 367:96:83) - 7.8667 


3 
-10, o, 2 —— = 1.53064 

Mo ? 1.96 

u~ N(u,0;) 


k= ——" — =0.63731 


= 2 2 
n+o°/ 0, 


2 
Ju. = (1L - k)u, + ky, = 8.6404, o. =,/k Č - 09218141 
n 


(1 ¥,)~ NGQL,o:) 
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a= I, = 8.6404, b-4o2*o?^/m = 1.3601 
(Y, ly.) ~ N(a,b^) 


q= n+(N-n)k _ 0.79275 


| ny, +m, 


yn QA, + dy, = 8.3088 


m 


d= we Ed 77717 
(yly,) * N(c, d^). 


So the predictive mean of y, the average of all 7 values in the finite 
population, is c — 8.3088, and the 9596 highest predictive density region 
for that average is (c+1.96d) = (6.7856, 9.8320). Figure 10.1 shows: 


(i) the likelihood function for the superpopulation mean, L(x), equal to 


the posterior density of zz under a flat prior; thus L() = fy (u) 


(Y,.07/n) 


(ii) the prior density of the superpopulation mean, 


FUD= fye, 92) QD 


(Ho 09) 


(iii) the posterior density of the superpopulation mean, 


fly fy, oy) 


(44,02 ) 


(iv) the prior density of the nonsample mean, 
f Cy) ES T NIS 


(v) the predictive density of the nonsample mean, 
FG. 13) Fras) 


(vi) the prior density of the finite population mean, 


[=i ea) 


2 
(419,00 


(vii) the predictive density of the finite population mean, 
FY) = frea): 
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Figure 10.1 Various densities in Exercise 10.1 


© | 
eo 
—— (i) Likelihood * (iv) Prior pdf of yrbar 
— — (ii) Prior =-=-  (v) Predictive pdf for yrbar 
» — - (ili) Posterior L <=- (vi) Prior pdf of ybar 
o SIN - — (vii) Predictive pdf for ybar 
AE 
[|^ 
= 4 * d ` 
fy ` The thin vertical lines show the predictive 


4 
4 \ | mean and 95% HPDR bounds for ybar 


density, likelihood 


mu, yrbar, ybar 


In Figure 10.1, we may observe how the prior densities of 1, y, and y 


are all centred around the prior mean 44 - 10. The line for 1 is most 


highly concentrated about 10 because it represents the prior density of 
the mean of a hypothetically infinite number of population values. The 


line for y,is the least focused about 10 because it represents the prior 
density of the mean of only 4 such values (compared with the line for y 
which is the prior pdf for the mean of 7 such values). 


Each of the posterior/predictive densities for 4, y, and y is located 
somewhere between the corresponding prior density and the likelihood 
function. The posterior/predictive densities for 42 and y, are centred at 
the same values, namely the posterior mean, 44 = 8.6404, whereas the 
predictive density for y is centred closer to the likelihood mode, 
y, - 7.8667. This is because the second credibility factor is larger than 
the first (q = 0.79275 > k = 0.63731). 
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R Code for Exercise 10.1 


ys=c(5.7,9.6,8.3); ysbar=mean(ys); ysbar tt 7.866667 

sig=2; n=3; N27; m=N-n; mu0=10; sigO-3/qnorm(0.975); 
k=n/(n+sig^2/sig0^2); q=(n+m*k)/N 

c(m,muO,sigO,k,q) # 4.0000000 10.0000000 1.5306404 0.6373060 0.7927463 
mustar=(1-k)*mu0+k*ysbar; sigstar2-k*sig^2/n 

c(mustar,sqrt(sigstar2)) # 8.6404139 0.9218141 

a=mustar; b2=sigstar2+sig*2/m; c=(n*ysbar+m*a)/N; d22(m/N)^2*b2 
c(a,sqrt(b2),c,sqrt(d2)) # 8.6404139 1.3600519 8.3088080 0.7771725 
HPDR=c+c(-1,1)*qnorm(0.975)*sqrt(d2); HPDR # 6.785578 9.832038 


X11(w=8,h=7); par(mfrow=c(1,1)) 

plot(c(4,15),c(0,0.6),type="n",xlab="mu, yrbar, ybar", 
ylab="density, likelihood", main="") 

v=seq(0,20,0.01) 

lines(v,dnorm(v,ysbar,sig/sqrt(n)), ty=1,lwd=3,col="black") 
# likelihood function (i) 


lines(v,dnorm(v,muO,sigO),Ityz2, lwd=2,col="red") # prior (ii) 
lines(v,dnorm(v,mustar,sqrt(sigstar2)),lty=2,lwd=3, col="red") # posterior (iii) 


lines(v,dnorm(v,mu0,sigO*2+sig*2/m),lty=3,lwd=2, col="blue") 
# prior pdf of yrbar (iv) 
lines(v,dnorm(v,a,sqrt(b2)),Ityz3,Iwdz3, col="blue") 
# predictive pdf of yrbar (v) 


lines(v,dnorm(v,mu0,sig0*2+sig*2/N),|Ity=4,lwd=2, col="green") 
# prior pdf of ybar (vi) 
lines(v,dnorm(v,c,sqrt(d2)),lty=4,lwd=3, col="green") 
# predictive pdf of ybar (vii) 
abline(v=c(c, HPDR), Ity=1,lwd=1) 
legend(3.8,0.6,c("(i) Likelihood","(ii) Prior","(iii) Posterior"), 
Ity=c(1,2,2), Iwdzc(3,2,3), col=c("black","red","red")) 
legend(10,0.6,c("(iv) Prior pdf of yrbar","(v) Predictive pdf for yrbar", 
"(vi) Prior pdf of ybar","(vii) Predictive pdf for ybar"), 
Ity=c(3,3,4,4), lwd=c(2,3,2,3), col=c("blue","blue","green","green")) 
text(12.5,0.38, "The thin vertical lines show the predictive") 
text(12.5,0.345,"mean and 95% HPDR bounds for ybar") 
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10.2 The general normal-normal finite 
population model 


The basic normal-normal finite population model examined in the 
previous section assumes that: 
* all N values in the finite population are conditionally normal and iid 


* we are interested only in the nonsample mean y, and functions of 


y, (such as the finite population mean y ). 


We will now examine a generalisation of this basic model which allows 
for: 
* non-independence of values 
* covariate information 
* inference on the entire nonsample vector and linear combinations 
thereof. 


We will continue to assume that the values in the population are all 
(conditionally) normally distributed, and that the (conditional) variance 
of each value in the finite population is known. We will now also 
assume that all the covariance terms between these values are known. 
(These assumptions will be relaxed at a later stage.) 


First, define the (finite) population vector in column form as 


Yı 
y 
Y, Yn : 
r Pa Yy 
Yn 


Next, suppose that auxiliary information is available in the form of an N 
by p matrix 


X-| ijQX)- HP o! | 


where 
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Xyj 


is the population vector for the jth explanatory variable (j —1,..., p). 


Also suppose that the finite population vector y has a known variance- 
covariance structure in the form of an N by N positive definite matrix 


Ow, ^77 Onn 

where: o; -C(y,y;))70j 
EDS. 
O; -Vy 20i, 


with the covariance and variance operations here (C and V) implicitly 
conditional on all model parameters. 


In the above context, the Bayesian model we will focus on is: 
(y|B) ~ N.(X B,X) 
B^ N,(0,Q). 


This model will be called the general normal-normal finite population 
model. Here, 


B, 
B^: 
B, 


is the vector of regression coefficients, whose prior distribution is 
multivariate normal with (specified) mean 
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and (specified) variance-covariance matrix 


Or  O, 
Oel $o^ "n 
Op op 
where: 9, 2 C(B, B) = 09 
Oi -V f, = o» , 


with the covariance and variance operations here (C and V) implicitly 
unconditional, thereby reflecting prior belief regarding the /, values. 


We will assume interest lies generally in the nonsample vector y, and 
functions of that vector, and specifically in the finite population mean y 
(a simple function of y, and of the known quantities y,, n and N). Thus 
the regression coefficient vector Ø will be treated as a nuisance 
parameter and inference will be based on the predictive distribution of 
y, given y,. 


Note: The basic normal finite population model as considered 
previously is a special case of the just-defined general normal finite 
population model with: 


poi B-(B)-u, 0=(0,)=%, Q - (9) - 0, 


X =1, =(...,1 i (a column vector of N ones) 
e o0 
0 0 

ie - = i 
© o U o 


(where Iy is the N by N identity matrix). 


Thus, the previous normal finite population model could also be 
written as: 


(ye NEU o 
BON Cu o 
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10.3 Derivation of the predictive distribution 
of the nonsample vector 


Observe that the unconditional (or prior) distribution of the entire finite 
population vector y is given by the density 


f= fod -[folA fua. 


Now, the integrand of this multiple integral is a quadratic in the y; and 
D; values. This implies that the value of the integral has the form of a 


quadratic in the y, values alone. This then implies that the prior (or 


unconditional) distribution of y is also multivariate normal. It then 
remains to find the mean and covariance vector of that prior distribution, 
as follows: 

Ey - EE(y| B) - E(X B) - Xó 

Vy = EV(y | B) w VE(y| B) - EZ-V(X B) 2X XOX'. 


Thus, y~ N,(Xó,E * XOX?). 


This result may also be written as 
Ys) oy XO) (2+ XOX! X,-X,OX, 
y, “IXS | _ +X Ox’ E, -XQOX!]l 
where we partition X and X according to 
X ye 2 
X=| `| ad xX—| ^" 7]. 
X, E Es 
Thus, X, 2| : [isa submatrix consisting of the first n rows of X, etc. 


It follows by standard multivariate normal theory (see below) that 
(y, | y,) a N (E.V) , 


where: 
E, 2 X ó - (E, +X OXE, - X,OX!) (y, - X6) (10.1) 
V, 2(X,- X OX!) -(Z, + X,OX (E, + XOX!) (X, + XOX"). 


(10.2) 
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Note: We have here used the following result (e.g. see equation 
(81.2.11) in Rao, 1973): 


[ze We: x) 
X, Ob Za X» 


=> (X,|X,)~ NS (uM, + nx = 14) 3» m 2m) . 


10.4 Alternative formulae for the predictive 
distribution of the nonsample vector 


Another way to obtain the distribution of (y,|y,) (already derived 
above) is as follows. First, the posterior density of 2 is 


Fly) FB) fr. LB) 
ac exp l-i -óya(g- 2 f-o, - X BYLY,- — 


-oa(-10) 


where 


Q, =(B- SJ Q"(B-5)+(y, - X,BYZ y, - XA). 


We see that f(//|y,) is proportional to the exponent of a quadratic 
form in 2 . This implies that 


(Bly,)~ N,(À,D) 


for some B and D to be determined. 


Now observe that 


1 
f(x) ew (-20.]. 
where 
Q, =(B- BYD (B- B) 
= P'D'B- P'D'f- Df + constant (10.3) 
(where the constant does not depend on £). 
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But Q, = /078- B'O'6-0'0  B-yX. X. 
—B'X'D ty, + B'X1X. X. P + constant 


S SS S SS 


= BQ" XZX.X)B-B(Q 0 XX y)-(0Q + y X.) 


S SS S SS SS 


* constant. (10.4) 


Equating (10.3) and (10.4) we see that: 
D'zQ'-XX.x, 


Df -Q64 X'Xjy,. 
It follows that: 
D-(Q'-XXX.)' 


B= D(Q?6 + X'Eiy,). 


S SS 


We can now use the result 
(By) ~ N,CÀ,D) 


to find the predictive mean and variance of y, . 


First, observe that 


(Y 1A) * N.(OCB,E) 


may also be written 


(c) 2) 


which implies that 
(y, | y,. D) ^ N UU *XQES(y, m X BE. -X XE.) . 


It follows that: 
E(y, |y)  EtEGy, | y, P)| y.J 
= E(X,B € XX (y, - X,B)| y. 
=X B+E Ey, -X B) (10.5) 
V(y, ly) =EN y, lys B) y. * VUEGy, y Aly 
- E(Z,, - E.EZX, |y) Va A - X X2 (y, - lyd 


=E, -E X En €VI(X,-XE2XB|y] 
= Ls T Ec. T (X, z XO X D(X, = 2.3 XJ * (10.6) 
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Note: The expression for E. at (10.1) must be the same as that for 
E(y, | y,) at (10.5), and likewise the expression for V. at (10.2) must 


be the same as that for V(y, | y,) at (10.6). This equivalence can also 
be shown with some algebra by making use of the formula 
(oos 2m X QX!“ zx run i X (Q` ar Xxx vm E 


which in turn follows from the general matrix identity 
(A-UW- V)! - A - A"U(W -VATU) VA. 


Here, I, is the n by n identity matrix and could also be written I, . 


10.5 Prediction of the finite population mean 
and other linear combinations 


We may now write down a general expression for the predictive 
distribution of the finite population mean. That mean may be expressed 
as the linear combination 


Yor t my, 


1 
=— (yp +1 y,). 
N m Jr) 


y= 
Note: Here, 1, denotes the row vector with m = N -n ones. This 


vector could also be written 1, or 1, , or (1,..,1). 


Therefore the predictive distribution of y given y, is normal with mean 
+I E, TV1 

e, = 251777. and variance v, = are 

N N 

So the 1-a CPDR for y is (e.+2,,../v.). 


More generally, the predictive distribution of the linear combination 


Vy = Cy (Cy, T.C.) (CLAY ua tet Cu YR) 
: : ! ! " C'V.c 
is normal with mean e, =C, * C,y, * C,E. and variance v, = LÁ 


where €, &(6,.40,)' and C, = (Y, 64) . 


So the 1- a CPDR for y is (e, £z, Jv, ). 
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Summary: For the general normal-normal finite population model: 
(y|B)~Ny(XB,%) 
P~ N,Q), 

the posterior distribution of the regression vector fj is given by 
(B1y,) " N,C,D), 

where: f-D(Q'ó-X'EXjy), D-(Q'-XX;X,).. 


S SS 


The predictive distribution of the nonsample vector y, is given by 
(Y, | Ys) ~ Np (EV) (m=N-n), 
where: E, = X,ó +(2,, + X,OXP))(Z. + XOX!) (y,- X,6) 
=X B*xXlQ,-X.) 
V, 2(X,- X OX!) -(Z, + X,OX (X, + XOX!) (X, + X,OX?) 
-ES-EXJGEJES +(X, -E ESX )D(X, -E XX. 


rs ss sr rs SS TS SS 


The predictive distribution of the finite population mean y is given by 


Yer *iE. DX 
MET =e, 


(Vly) i4 N (e.v), where ey = and Vi 


with 1-& CPDR for y given by (e. Ez aM 


The predictive distribution of any linear combination of the form 
V — C, * CL y, * C. y, is given by 

Qv |y) * N(ey v), 
where €,=C,+C.y,+C,E. and v,= um i 
with 1-a@ CPDR for y given by (e, + Zae) 


b 


10.6 Special cases including ratio estimation 


In the context of the above general normal-normal finite population 
model, suppose that p=1 (i.e. there is a single covariate) and the 
population values are conditionally independent, the ith one having 
mean X, and variance x" o^ , where y € 9t and c? » 0 are known. 
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Also, suppose that the prior distribution on the single regression 
coefficient / is normal with mean 6 and variance o. Then: 


(Y 1A) ~ N.QOCB,E) 


B~N,(6,Q), 
x x" 0 e 0 
X 0 2y 
where: p=1, X= : E= o. X - |o, Q-2o. 
Xy 0 0 od 


The model may also be written in non-matrix form as: 
(1g) LNAP x" o^ i1. N 
B7 N(6,0°). 


Under this model it can be shown that the predictive distribution of the 
finite population mean is given by 


(yly)* N(AB)), 


where: 
2 2 n 1-2y 
Ae. +(1- Jr jæ to Xia YX; | 
S r 2 2 n 2-2y 
N N SPO 24% 
2 (oN 2 0955 
Oo m ox 
B'- N? > x t— 25n - X3y 
i=n+1 Oo +@ i=1 X 


1% : : 
= x. (average of the covariate values in the nonsample). 
T i g8 p 
i=n+1 


r 


Now suppose it is believed that the variances of the population values 
are exactly proportional to the covariate values, i.e. V (y, | P) 2 xo^. 


Then y =1/2, and we find that: 


2— 2 
A= Ty, +(1- SR y, + 60 a 


N N a Xx,+o°/n 
2 2-— 
px (1-2 Zhi- z) E 
n N/UN N/o'x,c-o /n 
1 


X 
II 


n 
25 (the average of the covariate values in the sample). 
i=1 


S 
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If there is a priori ignorance regarding / we may further set @ =œ, and 
in that case: 


As yg dere epp m 
N N X, N N /X, 


on |nx,+(N -n)x, aX 
Nx, X, 


2 — 2 = 
gat te ee ea quale ee 
n N/|N N J X, n N J) X, 


_ 1X T" 
xX =— ` x, (average of covariate values in the finite population). 
i=1 


As regards this last special case, we see that the predictive mean A is 
identical to the common design-based ratio estimator. 


Also, the predictive variance B°, although not identical to any design- 
based formula, is the same as a model-based analogue (e.g. see Brewer, 
1963, and Royall, 1970). The formula for B^ suggests a purposive 
sampling scheme whereby units with the largest covariate values should 
be selected. 


Note 1: If units with relatively large y-values are selected, then x, will 


likely be larger than X,, so that then an will likely be small, and 
X 


S 


5) = 
thereby B?=V(y|y,)= (1 = zE X will also likely be small. 
n X 


S 


Note 2: The same formulae as derived in the last special case will also 
apply approximately when the sample size n is very large. This makes 
sense because the effect of a very large sample size is the same as that 
of a very diffuse prior. Note that in the case of a census, n = N and we 


find that the above formulae correctly yield A= y, and B^ =0. 


In a way similar to the above, it is possible to obtain analogues of other 
common design-based and model-based results, such as regression and 
stratified estimators, together with their associated variances (see 
Ericson, 1969 and 1988). 
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Exercise 10.2 Derivation of the Bayesian ratio estimator 


Consider the Bayesian model given by: 
(y, |B) ~L N(x,B,x,o^),i 23... N 


f(B)«1 Be. 


Derive the predictive distribution of the finite population mean given 
data of the form D = (s, y,). 


Solution to Exercise 10.2 


The Bayesian model is: 


(Y 1A) ~ N.(OCB,E) 


B~N,(0,Q), 
where: 
Xx 
p=1, .6=0, Q=%, X=x=| : 
Xy 
X, 
È = o°diag(x,,..., Xy) = 0° 
X 


Note: Here, o ^X is a matrix with zeros everywhere except for 
X,,...,Xy along the main diagonal. 


Using general results derived previously we first have that 


(Bly) ~ NCB,D), 


where: 
De(O «X X)" 
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B - D(Q^6 * X! y,) 


1 Xx X 
-D40c(x ++ x)— i 
o -1 
Xn JN Yn 
2 1 n EU 
= xy, 2 dan o s 
Xu i=1 X X. 
Next, 
(y, |y.) B N (E<, V), 
where: 
m-N-n 
Xia Xia = 
E.=X,8+2,22(y,-X,A)=| i Â+) : [4 
X 
Xy Xy ° 
Ve a 2 = 28299: + (X, T ELX D(X, E EX) 
Xa Xn oe? 
-g? "s -0«|| : |, -O0|—((x,, = xy)-0) 
X 
Xy Xy sT 
Xia Xana 07 Xn XN 
=0° +—| i l l 
Xor 
Xy Xy Xn Xy Xa 
Thus finally we have that 
(Fly) ~N (esv), 
where: 
1 Xia = 
Ysr + ily E, : Y; 
= = +(1 IN: = 
N Nu ( ) X, 
Xy 
E Yos Ysr _ Xs Xer + X,r aan Xr zJ 
sT N Xu Xu N X. 
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2 
V, —— ey a $us 1) 
N? N’ 
Xni XnatX nl Xnity 1 
x T= : : : 
Xu 
Xy XN Xn Xy Xa 

N N 1 
M 1 ! 
xi 72 (X, is Xy) — > ia UT Y xXx, : 
X i=n+1 i=n+1 1 


1 5 1 N N 
zi (acte Xy) + Xu hw Ry Y X 


sT i=n+1 i=n+1 


1 1 X XX 
=— 04 Xr t Xr) Lg LL 
N X 5 N A 


= Y 2 Y 
À QN n)X, Xor t Xr 29 (2n). 
N nx, N n 


=0°x 


Exercise 10.3 Practice with the general normal-normal finite 
population model 


Consider a superpopulation model in which all values are independent 
and normally distributed with mean x, and where each value y, has a 
variance which is either: 

o; if the corresponding covariate value X, is 0, or 


o? if X, - 1 (the only other possibility). 


Suppose that o;, o; and all N covariate values x, are given. Also 
suppose there is a priori ignorance regarding u. 


Find a simple expression for the predictive distribution of the finite 
population mean y . Then calculate the predictive mean and 9596 


predictive interval for y if: 
o = 0.08.0, = 1.2, y, = (2.1,49,2.3,2.0,0.2Y 
x= (01,001, 111,0,0, LLL1L0. 0D 
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Note: We have here defined a type of stratification; the finite 
population is assumed to consist of two strata with different variances 
but the same underlying mean in both strata. 


Solution to Exercise 10.3 


Let nj denote the number of covariate values x, in the sample (of size n) 
which are 0, and let n, be the number which are 1. Likewise, let m, 
denote the number of covariate values x, in the nonsample (of size 
m = N —n) which are 0, and let m, be the number which are 1. 


aX; and m, 2 m-m,.) 


i=n+1 “i 


(Thus, nos xs. ny,-n-n, m= 


Then, without loss of generality, re-order the finite population values in 
such a way that x, = (0,...,0,1,...,1)' and x, = (0,...,0,1,...,1)". 


(Thus, in each of the sample and nonsample vectors, place the values 
with covariate 0 first, and place the values with covariate 1 last.) 


With this setup, the Bayesian model is: 


(Y 1A) * N.QOCB,E) 


B~N,(6,Q), 
where: 
p=1, B=u, 6=0, Q=% 
X =1, (since the covariates do not affect the means 
N 
D 2 2 2 2 
X= diag(o,1, ,0,1, ,0,1, ,01 Lm ) 
(a matrix with zeros everywhere except for 
01,..,01,02,...,01,01,..., 01,01, ., 0? 
along the main diagonal). 
Then 


(Bly.)~N,(B,D), 


where: 
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De(Q «X X-XJ s(e «T5 Ll) 


S^- SS S'-SS' S 
=I 


O5 
1 
-2 
o 
edd ss Í M : 
(o 1) " 
o 
=ï 
1 
-2 -2 -2 2: 
x: (co; V 0 9, ) : 
1 
B 1 
no; no; 
B-D(Q'ó4X'Ely) -D(o'041Z,y,) 
-2 
Oo 
a Jı 
=D (1 . 1) : us : 
0, " 


E DG, yas + 87 Van) . 
Note: Here, 


Ysor = 
il 


denotes the total of the sample values with covariate x, =0, and 


Map = »3 Mi 


i=nọ+1 


denotes the total of the sample values with covariate X, =1. 
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Next, (y, | yj) ~ N,(E,,V.), where m = N -n and: 
E, = X, B - X Z2 (y, - X, B) -L B «0-1, 
V, 2X. -EQEQXE, +(X, -2 EZX )D(X, -2 EZX.) 


n 
o? 
= 9 -0+ (1, - 0)D(1, -07 
p (1, - 0)D(1, — 9) 
e 
n 
MI UM! 
2 
- P Bis 
Oi 
MIT 4 
ei 
Thus (y | y,) ~ N (e.,v.) , where: 
+E, 1 ^ » 
aJa " EPI 11,8] =y +B} 
IVA 1 
= P o : 1) 
"m - 
1o Df 
B, 
x " -D : 
^ T ww DN 
L o? | 
i 1 
-[( = o of = ot)j-D(m om]: 
1 


1 
= J M + mo? - Dm?) s 
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In summary, we have that (y | y,) ~ N (e.,v.) , where: 


= Yor + mB 


_ Q -2 -2 
[^ N , m-N-n, f-D(oVysrt9; Yar) 
2 2 2 
D- 1 v _ Moy + Mo; -mD 
-2 -2 ? * 2 g 
yO, +n o; N 


Numerically, we are given: 
0, =0.08, o, =1.2, y, =(2.1,4.9,2.3,2.0,0.2)" (thus n= 5) 
x=(0,1,0,0,1, 1,1,1,0,0, 1,1,1,1,0, 0,1)’ 
(thus m = 12 and N =n + m= 17) 
x, =(0,1,0,0,1)', x, =(1,1,1,0,0, LLLLO. 0D. 


We now re-order the sample and nonsample values appropriately and so 
redefine: 


y, =(2.1,2.0, 2.3, 4.9, 0.2)' 
x, = (0,0,0,1,1)' 
x, =(0,0,0,0,1, 1,1,1,1,1, 117. 


Note: We have merely swapped units 2 and 4 in both y, and X,, 


respectively, so that all units with covariate 0 appear first and all units 
with covariate 1 appear last. We have also written the nonsample 


covariate vector X, with all four zero values listed at the beginning. 


We see that: 
n =3,n,=2, m =4, m =8 
yar = 2.142.04+2.3=6.4, Yor =4.9+0.1=5.1, 
Yop =6.4+5.1 = 11.5 


ys*54/8521333, Y.-51/2 52.55, 


y, =11.5/5 2 23. 


Thereby we obtain (y | y,) ~ N(e.,v.), where: 
1 1 


D = —— = m 
no, +no 3/0.08°+2/1.2° 


= 0.0021270 
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B= D(os Voor + Oar) = 0.0021270(6.4 / 0.08? 4-5.1/1.2?) 
= 2.1345 

Yq ¢mB 11.5412 2.1345 

O Noo 7o 


e, = 2.1832 


_ mo, -m,o; -mD  4x0.08^ +8x1.2° —12* x 0.0021250 
l N? 17° 
= 0.038890. 


Thus the predictive mean of the finite population mean y is B = 2.13, 
and the 95% predictive interval for y is (e. t1.96,/v. ) = (1.80, 2.57). 


R Code for Exercise 10.3 


options(digits=4) 
sig0z0.08; sig1=1.2; ys = c(2.1,2.0,2.3,4.9,0.2); n=length(ys) 
xs=c(0,0,0,1,1); xr 2 c(0,0,0,0,1, 1,1,1,1,1, 1,1); m=length(xr); N = n+m 
n1=sum(xs); nO=n-n1; mizsum(xr); mO=m-m1 
c(n,nO,n1, m,m0,m1, N)#5 32 1248 17 
ysT=sum(ys); ys1T=sum(ys*xs); ysOT=ysT-ys1T 
ysbar=ysT/n; ysibar-ys1T/n1; ysObar-ysOT/nO 
c(ysOT,ys1T,ysT, ysObar,ysibar,ysbar) 
#6.400 5.100 11.500 2.133 2.550 2.300 


D=1/( n0/sigO^2 + n1/sig1^2 ); betahat = D*(ysOT/ sigO^2 + ys1T/ sig1^2 ) 
estar=(1/N)*( ysT+m*betahat ); 

vstar-(1/N^2)*(mO* sig042+m1* sig1^2-D*m^2) 

c(D,betahat,estar,vstar) # 0.002127 2.134564 2.183222 0.038890 
hpdr=estar+c(-1,1)*qnorm(0.975)*sqrt(vstar); c(hpdr) # 1.797 2.570 


10.7 The normal-normal-gamma finite 
population model 


For the models so far considered in this chapter, the superpopulation 
variance o^ parameter or variance-covariance matrix parameter X has 
been assumed to be known. 


If this parameter were unknown, as might typically be the case in 
practice, then an estimate could be computed from the data via some 
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method (which need not necessarily be Bayesian) and substituted into 
the equations derived. 


This strategy, which may be considered an example of empirical Bayes 
techniques, May sometimes work well, especially if based on a 
sufficiently large sample size. 


For example, recall that in the case of no covariates, with the 
superpopulation variance o^ known, the 1— 2 CPDR for y is 


Now suppose that n is large and we estimate o^ by the sample variance, 


1 « i 
s -——»-XY. 
Nt jy 


Then the result is the same as the classical design-based CI one would 
use in the same situation of a large sample size. 


However, this strategy will not work well generally. For example, if n is 
small then it will lead to an interval which has a frequentist coverage 
well below the intended level of 1—a@. In such cases, the problem could 
be addressed to some extent by applying an adjustment which reflects 
uncertainty regarding the unknown variance parameter. However, the 
nature of this type of adjustment would be ad hoc and lead to possibly 
other problems with the inference. 


Perhaps the best way to deal with uncertainty regarding the variance 
parameter is to incorporate it into the finite population model as yet 
another random variable with its own prior distribution, i.e. to add 
another level to the hierarchical structure of that model. This is the 
approach we will now take. Note that parts of the exposition below will 
be a review of material already covered in previous chapters. 


With the above in mind, and with quantities as defined previously, we 
define the normal-normal-gamma finite population model as follows: 


(y| B, 4) * N(XB,E1 A) 
(B| A) * N,(6,Q) 
A" G(n,r). 
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A problem with this model is that is involves an additional nuisance 
parameter to deal with relative to the normal-normal finite population 
model, namely 2. This means that the predictive pdf of the nonsample 
vector cannot be obtained so easily. 


That density is now 
f(y. 1y) =f f fo BAly Id Bad & | | f. 8. 20a Bd, 


(10.7) 
where f(y,B,A4) = f OOfCB|A) f Cy | B. A) 


oc AT le" x exp|-Z(B -óyo(g- 2 


xA? exp {-540 - X yx (y -xp 


is the joint density of all random variables involved in the model, 
namely the N finite population values, y,,.., yy , and the p+1 model 


parameters, namely 4 , p,- p . 


In an attempt to perform the second double integral at (10.7) (which is 
actually a ( p 4-1) -fold integral), we may first integrate with respect to A 
and obtain 
t  exp(-(1/2(8 -óyQ (8-98) 
Kore | Pe U 
XI -0/2(y-XJB)X (y- XB) 
(after recognising a gamma density in Å ), or first integrate with respect 
to 2 and obtain 


dp 


oo Am 


1 
-A|r--y£E! 
fo, ec re | (= M ») 


0 


-(Q'6--AX'E!y)(Q'-AX'E!X) (ATS + axx yaa 
(after recognising a multivariate normal density in £ ). 


Either way, the remaining integral is in general impossible to perform 
analytically, and the posterior predictive distributions of the nonsample 
vector and linear combinations of that vector (such as the finite 
population mean and total) are not normally distributed. However, there 
is an important special case which simplifies matters considerably. 
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10.8 Special cases of the normal-normal- 
gamma finite population model 


Theorem 10.1: Suppose there is priori ignorance regarding / and it is 
appropriate to set ó =0 and Q — o, so that 


f (B1A) « f(B) « 1, Be9m. 


Then the predictive distribution of the finite population mean is given by 


y-a 
b 


Yor * X B € X, EJ, - XA) 
N 
ÉZ, - E, EJX, + ADA'T[27 4 (y, — X, By Xy, - X,B)] 
(2g +n—p)N? 
BSDXXiy. JODsQUE Xx y5 A=X, -E 5 X. 


S SS TS SS 


y, J-tnn- p. 


where: a — 


b’ = 


Note: Here, p is the MLE of £, and also the posterior mean of 8 
under the simpler normal-normal finite population model with 
improper prior f (7) oc 1, f € R (and o? known). 


Theorem 10.1 can be proved by first noting that: 


(a) (A|y,) is gamma (with parameters that can be obtained by 
integrating f(f,A|y,) with respect to £), and 


(b) (y |y, A) is normal (with parameters that can be obtained by 
examining the normal-normal finite population model above). 


Using these two distributions, one can solve for the predictive density of 
the finite population mean via the identity 


fIy)- [fo.A1y0dA- [ for y, D f 192. 


A special case of Theorem 10.1 which assumes a priori ignorance of A 
by way of setting 7; 2 7 20 can be found in Royall and Pfeffermann 


(1982). 
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If we further assume conditional independence (which may expressed by 
writing 2: — I, ) and no auxiliary information ( p 21 and X =1,), the 
result in Theorem 10.1 reduces to 


ye Y, A 
z /n)V1-n/N 3 ee 


where s? = —À y,-Y.) (the sample variance) . 
n- 


i=1 


This result was already proved in a previous chapter without the 
involvement of vectors and matrices. Again note that the result leads to 
point estimates and interval estimates which are identical to those which 
one might construct using a design-based approach (see Cochran, 1977, 
Section 2.8). 


Exercise 10.4 Proof of Theorem I 


Prove Theorem 10.1 above. 


Solution to Exercise 10.4 


Using the procedure outlined above, we first derive the unconditional 
pdf of A as follows: 


T = A 
fy | fBAly)dB « fae ‘etx? exp -Za Jap. 
where 
Q - (y, - X,B) X; (y, - X,B) 
= yE y,- y, XIX. p= B XEY, tp XXIX, p. 


Now equate Q, with 


-(B-T)M(B-T)-*R (where R stands for ‘remainder’) 
= B'M B- B'MT -TM B«TMT +R. 


We see that 
M 2 XXX, X, and MT = XIX, y,, 
so that 
T-M'(MT)-(XIX,X,)'XXjy,. 
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Note: Here, T is the same as B in Theorem 10.1. 
Also, R= y/E;y, -TMT =(y,-X,T)2.(y, -X,T). 


Note: This is easily proved by noting that the RHS here is 
YS Y, = Vee al -TX:X. ds +TX {DS X T 


where 
exeo A T VAAI =T y 


SSS 


(since yX, X ¿T is a scalar quantity), and where 


TOS A UN UU CDU NE IDEE 


so that the RHS equals 
y. y, -T'MT - T'MT - T'MT = y'd ly, -T'MT . 


Thus 


f (Aly) a [Ae 1x4 exp -S[B-TYM(B-T)'+ Rap 


"M n eof -a(z JJ xI, 


r-jes[ (B n£ 2I n B 


= (Qn)? L e) 


(using standard multivariate normal theory) 


where 


p 
œA? (since M = X/X. X is a p by p matrix). 


It follows that 


B A 
fAly)«4 ?? es (4) es[754) 


where: 
A=2n+n—p, B=2r+R, Rc-(y,-X,T)X,(y,- XT). 
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We thereby arrive at the required distribution, 
(4]y,) ~ G(A/2, B/ 2), 
which may also be expressed by writing 
(BA|y,) ~ G(A/2,1/2) 2 r7 (A). 


Having derived the posterior dsn of A , we now observe that 
(Yly,, A) ~ N(&$,v,), 


where: 
€, Os SLE E, = X,T+2,,2,(y,—-X,T) 
IVA w T, Vj, 
nny AC MoE? 
V,=G+AM"A' 
G=E_-E_225., A=X,-¥,2,X,. 


Note: We have here simply applied the theory of the normal-normal 
finite population model with Q =œ and with quantities such as È, 
and X, replaced by 2, / 4 and X, / À , etc. 


Therefore 


FV y= [FO lyoAFAlY JAA 
en ee GR NOT S 
« [4 api aw, 0 zu A exp T 


AH Tpi? 
= f4 2 ` exp4-4 BQ-e a; 
2 2w, 


oc B Gre p oc m (y-ey p 
|2 2w, Bw, 
| [Z-a] (7) | y -e J T 
p e A, oc | 1+ 22———— —— Eu ; 
A A 
| 
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It follows that [5e 


0 


Wo 


" ^t(A), where h? A 


Here: A=2ņ+n-p (which is the same as the degrees of freedom in 
the t distribution in Theorem 10.1) 


Yo *LLLX,T +2 E; (y, - X,T)] 
N 
(which is the same as a in Theorem 10.1). 


1 
€, ay Os +1, Eo) xi 


B 2r+R TV1, 
w= x 


kè = = 
2y-n-p N’ 


° A 


— ! = — 
= [27 +(y, XT) zs Vs X,T)] 1 (G + AM ! Aog, 
(2n * n— p)N 
[| [27 * (y, — X TYE (y, exu) 
(27 * n- p)N? 


SSS: TS SS 


xI, Iz. -XQX.E, +(X, -XQX.SXQQGEQX, ) (X, -—h zx) 
(which is the same as b^ in Theorem 10.1). 


That completes the proof of Theorem 10.1. 


10.9 The case of an informative prior on the 
regression parameter 


If there is some prior information available regarding the regression 
parameter // then Q <œ and Theorem 10.1 above cannot be applied. 
So the problem of inference on the finite population mean y becomes 
much more difficult. 


However, that difficulty can be easily ‘sidestepped’ via Monte Carlo 
methods based on a random sample from the predictive distribution of 
y , namely 


y9,.., y? iid f(y|y,). 
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With such a sample we can, for example, estimate y's predictive mean, 


pV) 


namely y = E(y | y,), by the average of y?,.., y, and estimate y's 


95% CPDR by the empirical 0.025 and 0.975 quantiles of y ,..., y ^? . 


This then raises the question of how the Monte Carlo sample can be 
obtained. In this context, we may employ the method of composition via 
the equation 


fO BAY) = FO lys B. AM GLA Ty). 


Thus, we first generate a sample from the joint posterior distribution the 
two parameters, 


(99, APT s ( e yend TOLATyO. 
and then for each j =1,...,J we sample 


ye F1», 9,409) n| 


r L0 MINE d 


N "INA 


+1. X, Pp yx | 


This in turn raises the question of how to obtain the sample from 
f (B, À | y,). In this case an ideal solution is to apply a Gibbs sampler 
defined by the following conditional distributions: 


1. (Bl y,.4)~ N,(B,D), 
where: f - D(Q^6 - AX'ly,) 


S 


D-(Q'-AXIX). 


2. ys o|» ot +0, =X pyz Q- x) ; 


Note: The first of these distributions derives directly from the normal- 
normal finite population model with X,. and X, replaced by X, / A 


and È / À , etc. 


The second conditional is obtained by noting that 


f (Aly, B) « f. ly.) 
æ f(A, B. y) 
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= f(A f(A f(y; 14,2) 
CA Rex a-i- 


x A’ exp [-S0. -X PYE O, -X D) 


2" en afe+io, x30, -x,5) 


Exercise 10.5 Practice with the normal-normal-gamma finite 
population model 


In the context of the normal-normal-gamma finite population model, 
suppose we obtain a sample of size n = 5, with values given by 


Y, S(Ysos Y). = (5.6, 2.3, 8.4, 5.1, 4:3)! 
via SRSWOR from a finite population of size N = 15. 


Find the predictive mean and 9596 central predictive density region for 
the finite population mean y in each of the following scenarios. 


(a) There are no covariates, the population values are conditionally iid 
and there is no prior information available regarding the model 
parameters. 


(b) The population values are conditionally independent, the ith 
population value has mean x, and variance x, /A (i = 1,...,N), the 
population covariate vector is 
X= (Xu Xy) = (9.3, 4.6, 15.0, 11.2; 7.8, 2.4, 6.6, 3.0, 2.1, 7.3, 
5.5,8:0, 2.4; 4.2. 5.5), 
and there is no prior information regarding the model parameters. 


(c) There are no covariates, the population values are conditionally iid, 
the prior on the normal mean is normal with mean 10 and variance 2.25, 
and (independently) the prior on the normal precision parameter (inverse 
of the normal variance) is gamma with mean 2 and variance 1/2 (or 
equivalently, gamma with parameters 8 and 4). 
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Solution to Exercise 10.5 


(a) In this case, Theorem 10.1 reduces to 


y, Jn. 


y-y, 
us 
where: y, sey Tescby )-5.140 
n 


1 n 
s? 2 — JM (y, - y)? = 4.9030. 
n-—iij4 


So the required predictive mean and 9596 predictive interval of y are 


y, 25.140 and [10-05 m - (2.8951, 7.3849). 
n 


(b) In this case (a variation of Bayesian ratio estimation as discussed 
earlier) we apply Theorem 10.1 with: 


Xi 


p=1, n=r=0, X =x, X-diag(x) - 


Xy 


Instead of deriving a ‘simple’ general algebraic expression for the 
predictive distribution of the finite population mean in this case, we can 
obtain the specific required result more quickly by directly applying the 
formulae in Theorem 10.1 using R. An advantage of this approach is 
that it leads us to write a general algorithm in R which can be 
straightaway used in other situations requiring Theorem 10.1. Also, the 
algorithm can be used to check our answer to part (a). 


Thereby we obtain the result that 


y-a 
ON S NN ^ t(C 5 
| 5 3 (c) 
where a = 3.3945, b = 0.1159 and c = 2g +n- p = 4. 


So the required predictive mean and 95% predictive interval of y are 
y, = 3.3945 and (a+t,,.(c)b) = (3.0725, 3.7164). 
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Note: This inference is lower than that in (a) because the mean of the 
covariate values in the nonsample is 4.7, which is much lower than 
their mean in the sample, 9.58. The regression coefficient / in our 


model is estimated as 0.5365, reflecting the positive linear relationship 
between the x and y values in the sample. 


(c) In this case, a good option is to first employ the Gibbs sampler to 
generate a random sample from the joint posterior distribution of / and 
A, with: 

p=1, 6 =10, Q=9 7=8 T=4, X =1,, X-diag(l,). 


The two conditional distributions are: 


1. (Bl y,,4)~N,(B,D), 
where: 
B - D(Q^6 - AX'Xlly,) 


S SS 


D-(Q'-AXUO X) 


S 55 


2. Aly, B) a(n kou +0, «X y. x 


But, by analogy with the simpler normal-normal model and normal- 
gamma model, these conditionals must be equivalent to: 


1. (B |y, A4) * N(B,,0:), 
where: 
B; - d- k,)B, * k, y, 
o e 
7^ nÀ ^ ntl/ (Ao?) 
f, =10, 0,723 


2. (Aly, E) G| ger Ls, |, 
2 2 
where 


1 n 
s 2 — py E 
i-l 
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Either way, implementing this Gibbs sampler for 10,100 iterations with a 
burn-in of 100 we obtain the trace plots and histograms for # and A in 
Figure 10.2. (The two subplots on the left are for £ , and the two on the 
right are for 4. The histograms do not include the first 100 iterations.) 


Thinning the last 10,000 values of each parameter by a factor of 10 we 
obtain an approximately random sample of size J — 1,000 from the joint 
posterior distribution of the two parameters, namely 


(B,,À,) iid f(B,A| y.) j= Loved: 


The sample ACFs over the entire sample of 10,000 and over the thinned 
sample of 1,000 are shown for each of / and / in Figure 10.3. (E.g. the 


top-left subplot is for over the entire sample of 10,000.) The thinning 
process has virtually eliminated all signs of autocorrelation. 


Figure 10.2 Trace plots and histograms 
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Figure 10.3 Sample ACFs 
(Top two: J = 10,000; Bottom two: J = 1,000) 
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Using our sample from the joint posterior of the two parameters we now 
generate a sample from the predictive distribution of the nonsample 
mean by drawing 


tj = 1 
y= f Cy, DEL 


j4——————- | for each j = 1.,...,J. 
"(N-nA, 


Note: The result is 


Ea he È iid f Cy, | y: 
by virtue of the method of composition and the equation 


FQ. B. Aly) FY, 1y,. 8.4) f (B. A y.). 


We next form a random sample from the predictive distribution of the 
finite population mean by calculating 


"EE. ! 
y = — (ny. - (N - nyy? ) for each j = 1,...,J. 
yea ny ev my) j 

Note: The result is y ,..., y ^ iid f(y|y,). 
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We now estimate y (and y's predictive mean, y = E(y | y,)) by 
Lo Te ; 
y= 2208 = 5.555, 


j=l 


with 95% CI for y equal to 


F 1 ` Sog La 
On y 2 (5.526, 5.584). 


We also estimate the 9596 CPDR for y by (4.685, 6.633), where the 


bounds of this interval are the empirical 0.025 and 0.975 quantiles of 


30 U) 
Vo ase . 


Another approach to performing Monte Carlo inference on y is via 


Rao-Blackwell methods. This approach does not require the sample 


y®,... y? and should provide more accurate Monte Carlo estimates. 


The idea is based on the identities: 


fly)- [fG..A1y)dBdA 


-[fG1y. 5,2 f(,41y)dgdA 


y-E(y|y) - E,,IEQ y, B) 
fly) E, lf GIHy, BA) 


3j 
yt: 


Now note once again that: 


y - (r7, +(N-ny,) 


- 1 
(y. | y, 8.2) Nc 


So we now define: 
e(B, A) » ECy | y,. B, A) 


== (r7, +(N - n)E(y, | y,. B, A)) 


== (r, *(N -n)B) 
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WB, 4) -VCy |y, B, A) 
N-n 
- 079 vs y, p.a) 


N-n? — 1 N-n 
N? (N-n) N’A 


e, - e(f A; aem +(N -n)f;) 


v; 2 (Bj, À;) "e 


Note: Since e(f, A) does not depend on A , we may also write e(f, A) 
as e(f). Likewise, since v(f/, 4) does not depend on £ , we may also 
write v(B, A) as v(A). 


Then the Rao-Blackwell estimate of y (and y = E(y | y,)) is 


e-— -Ye, = 5.557, 
Jal 


with 95% CI for y working out as 
J 
e +1.96 = (5.534, 5.581). 
e = Ge sam 


Note: The width of this Rao-Blackwell CI is 5.581 — 5.534 = 0.046, 
which (as could be expected) is less than that of the earlier CI, namely 
5.584 — 5.526 = 0.058. 


We can now also obtain the Rao-Blackwell estimate of the CPDR for y. 


First, the Rao-Blackwell estimate of the predictive density of y (that is, 
of f(yly.)) is 


Tea ont 
ER SUAE 
ae 
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Note: The simplest and most ‘basic’ estimate of f(y|y,) is the 
‘histogram’ estimate, f (y | y,), obtained by smoothing a histogram of 


the sampled values y',..., y ~ iid f(V| y,)- 


The Rao-Blackwell estimate of the 95% CPDR of y is (L,U), where L 
and U dol 


H Da eol- ley nns 
ES e E dy = 0.975 
ANN 2v, : 


ja 


To obtain L we rewrite the first of these two equations as 
1 J 
—» P(X, < L)=0.025, 
j=l 
where X, ~ N(e,,v,), or equivalently as 


J L = 
of i- 0.025 (where © is the standard normal cdf). 


F, 


We can now solve this equation in a number of ways, for example by 
minimising the function 


1X4 L-e, J i 
aoig) os 


(whose minimum is 0 at the required L), 
e.g. using the optim() function in R. 


Likewise we can obtain U by using optim() to minimise 


M 


(whose minimum is 0 at the required U). 


Note: We could also obtain L and U using trial and error or the 
Newton-Raphson algorithm. 
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Implementing the above procedure we arrive at the required Rao- 
Blackwell estimate of the central predictive region for the finite 
population mean: (L,U) = (4.707, 6.542). 


Note: This is similar to the previous ‘histogram’ estimate of the 
CPDR, (4.685, 6.633). 


Figure 10.4 shows a histogram of the J = 1,000 simulated values 
yy? ^ iid f(¥|y,), together with the histogram estimate y and 
the Rao-Blackwell estimate Z of y = E(y | y,). Also shown are the two 
corresponding 95% CIs for y . The histogram is overlaid with the 
histogram estimate f (y|y,) and the Rao-Blackwell estimate f| y) 
of f(y|y,). It will be observed that the Rao-Blackwell estimate 


provides the smoother result. 


Figure 10.4 Inferences on the finite population mean 
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R Code for Exercise 10.5 
# (a) 


options(digits=4); N = 15; ys = c(5.6,2.3,8.4,5.1,4.3); n = length(ys) 
est-mean(ys); ss2-var(ys); varybar=(ss2/n)*(1-n/N); tval- qt(0.975,n-1) 
cpdr=est+c(-1,1)*tval*sqrt(varybar) 


c(est,ss2,sqrt(ss2), varybar, sqrt(varybar), tval, cpdr) 
# 5.1400 4.9030 2.2143 0.6537 0.8085 2.7764 2.8951 7.3849 


# (b) 


NNGFPM= function(eta=0, tau=0, alp=0.05, 
ys= c(5.6,2.3,8.4,5.1,4.3), X=rep(1,15) , N=15, sigma=diag(rep(1,N)) ) 
{ 


# This function performs inference under the normal-normal-gamma 
# finite population model. 


# Inputs: eta, tau, alp, ys, X, N, sigma 
# Outputs: A list with Sa, Sb and Sc indicating (ybar-a)/b given ys ~ t(c) 
p=ncol(cbind(NA,X))-1; nzlength(ys); c=2*etat+n-p 


ysT=sum(ys); Xs=cbind(NA,X)[1:n,][,-1]; Xr=cbind(NA,X)[(n+1):N,][-1] 
sigmass-sigma[1:n,1:n]; sigmarr=sigma[(n+1):N,(n+1):N] 
sigmasr=sigma[1:n,(n+1):N]; sigmars-t(sigmasr) 
D=solve(t(Xs)%*%solve(sigmass)%*%Xs) 

beta=D%*%t(Xs)%* %solve(sigmass)%*%ys 
A=Xr-sigmars%*%solve(sigmass)%*%Xs; ^ oner=rep(1,N-n) 


a-(1/N)*'(  ysT +  t(oner)%*% 
( Xr%*%beta + sigmars%*%solve(sigmass)%*%(ys-Xs%*%beta) ) 


b2-(1/(c*N^2)) * ( 2*tau + t(ys-Xs%*%beta)%* %solve(sigmass)%*% 
(ys-Xs96*96beta) ) * t(oner)%*% 
( sigmarr-sigmars%*%solve(sigmass)%*%sigmasr + 
A96*96D96*96t(A)) %*% oner 


b=sqrt(b2); cpdrza-*c(-1,1)*qt(1-alp/2,c)*b 
list(a=a,b=b,c=c,beta=beta, cpdr=cpdr) 
} 
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# Test function by using it to check (a): 
res= NNGFPM(); c(resSa,resSb,resSc,resSbeta, resScpdr) 
# 5.1400 0.8085 4.0000 5.1400 2.8951 7.3849 Same as in (a) OK 


# Apply function with covariate info: 
xvec=c(9.3, 4.6, 15.0,11.2, 7.8, 2.4, 6.6, 3.0, 2.1, 7.3, 5.5, 8.0, 2.4, 4.2, 
5.5) 
res= NNGFPM(X=xvec, sigma=diag(xvec)) 
c(resSa,resSb,resSc,resSbeta,resScpdr) 

# 3.3945 0.1159 4.0000 0.5365 3.0725 3.7164 


c(mean(xvec), mean(xvec[1:5]), mean(xvec[6:15]) ) 4 6.327 9.580 4.700 
# (c) 
ys= c(5.6,2.3,8.4,5.1,4.3); ysbarzmean(ys); n = 5; N = 15; options(digits=4) 


GIBBS = function(J=1000,ys= c(5.6,2.3,8.4,5.1,4.3), 
bet=1, lam=1, bet0-10, sig0-1.5, eta=8, tau=4) 

{ 

betv=bet; lamv=lam; sig02=sig0^2; n=length(ys); ysbar=mean(ys); 

for(j in 1:J){ 
klamzn/(n*1/(lam*sig02)); sig2lamzklam/(n*lam) 
betlamz(1-klam)*betO-klam*ysbar; 
bet=rnorm(1,betlam,sqrt(sig2lam)) 
s2bet=mean((ys-bet)*2); lam=rgamma(1,etatn/2,tau+n*s2bet/2) 
betv=c(betv,bet); lamv=c(lamv,lam)  ) 

list(betv=betv, lamv=lamv) 


} 


set.seed(641); res=GIBBS(J=10100); X11(w=8,h=5.5); par(mfrowzc(2,2)) 
plot(resSbetv,type="I"); plot(resSlamv,type="I") 
hist(resSbetv[-c(1:101)], prob=T, nclass=30); 
hist(resSlamv[-c(1:101)], prob=T, nclass=30) # Fig. 10.2 


betvec=resSbetv[-c(1:101)][seq(10,10000,10)]; J = length(betvec); J # 1000 
lamvec=resSlamv[-c(1:101)][seq(10,10000,10)] 


acf(resSbetv); acf(resSlamv); acf(betvec); acf(lamvec) # Fig. 10.3 


betbar=mean(betvec); betci=betbar+c(-1,1)*qnorm(0.975)*sd(betvec)/sqrt(J) 
c(betbar,betci) # 5.766 5.731 5.801 
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set.seed(121); yrbarvec-rnorm(J, betvec, 1/sqrt((N-n)*(lamvec)) ) 
yrbarbarzmean(yrbarvec); 

yrbarci= yrbarbar+c(-1,1)*qnorm(0.975)*sd(yrbarvec)/sqrt(J) 
yrbarcpdr=quantile(yrbarvec, c(0.025,0.975)) 
c(yrbarbar,yrbarci,yrbarcpdr) # 5.762 5.718 5.806 4.458 7.380 


ybarvec-(1/N)*( n*ysbar + (N-n)*yrbarvec ) 
ybarbar=mean(ybarvec); 

ybarci= ybarbar+c(-1,1)* qnorm(0.975)*sd(ybarvec)/sqrt(J) 
ybarcpdr=quantile(ybarvec, c(0.025,0.975)) 
c(ybarbar,ybarci,ybarcpdr) # 5.555 5.526 5.584 4.685 6.633 
ybarci[2]-ybarci[1] # 0.05849 


evec=(1/N)*(n*ysbar + (N-n)*betvec); vvec-(N-n)/(N^2*lamvec) 
ebar=mean(evec); eci=ebar+c(-1,1)*qnorm(0.975)*sd(evec)/sqrt(J) 


Lfun=function(L){ ( 0.025-mean(pnorm( (L-evec)/sqrt(vvec) ) ) )^2 } 
L = optim(par=3,fn=Lfun)Spar; L# 4.707 (ignore warning message) 
mean( pnorm( (L-evec)/sqrt(vvec) )) #0.025 OK 


Ufun=function(U){ ( 0.975-mean(pnorm( (U-evec)/sqrt(vvec) ) ) )A2 } 
U = optim(par=7,fn=Ufun)Spar; U # 6.542 (ignore warning message) 
mean( pnorm( (U-evec)/sqrt(vvec) )) #0.975 OK 


ecpdr=c(L,U); c(ebar,eci,ecpdr) # 5.557 5.534 5.581 4.707 6.542 
eci[2]-eci[1] # 0.04642 


X11(w=8,h=7); par(mfrowzc(1,1)) 

hist(ybarvec, prob=T,nclass=20,xlim=c(3.5,8), 
xlab="ybar",ylab="density/relative frequency",main="") 

lines(density(ybarvec), lty=2,lwd=3,col="blue") 

abline(v=c(ybarbar,ybarci,ybarcpdr),|lty=2,lwd=3,col="blue") 


ybarv=seq(3,8,0.01); fv=rep(NA,length(ybarv)) 

for(i in 1:length(ybarv)) fv[i] = mean(dnorm(ybarv[i], evec, sqrt(vvec))) 
lines(ybarv,fv, ty=1,lwd=2,col="red") 
abline(v=c(ebar,eci,ecpdr),Ity=1,lwd=2,col="red") 


legend(3.4,0.9,c("Histogram","Rao-Blackwell"), 
Ity=c(2,1), lwd=c(3,2),col=c("blue","red"), bg="white") 


514 


Transformations and Other Topics 


11.1 Inference on complicated quantities 


So far, in the context of Bayesian finite population models specified by: 
f(€|y,@) where € iss or I or L (as discussed earlier) 
f (y |0) where 
y = (VV) = Qs Y Yno Yn) = Qu Yn) 
f (8) where 0 = (0,...,0,), 
we have been focusing primarily on two finite population quantities, the 
finite population total y; = y, t... y, and the finite population mean 


ys(yt.-ygy) N25y/N. 


These are special cases of the class of linear combinations of the N 
population values 


=C OQ Tuy; 
for which inference is often straightforward, such as in the context of the 
general normal-normal-gamma finite population model. 


We will now consider other inferential targets. 


Generally, suppose we are interested in the quantity w = g(0, y), where 
g is a potentially very complicated function of all q model parameters 


and all N finite population values. In such cases, we may adopt the 
following four-step strategy. 

Step 1. Obtain a sample from the posterior distribution of 0 = (0,,..., 0,), 
that is 6®,..., 0? ~ iid f(0| D), where 0°” = (ONO) and where D 
is the data, typically defined as (s, y,) or (J,s) or (L, y,) as discussed 
previously, and whichever the case may be. 


Make use of special techniques if suitable, e.g. the method of 
composition and MCMC methods like the Gibbs sampler. 
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Step 2. Use the sample in Step 1 to generate a random sample from the 
predictive distribution of the nonsample vector y, — ( You Yy), that is 
y. f? iid Fy, | D), where y? =OP, yP): 

Make use of special techniques if required. 


Often, the sample can be obtained easily via the method of composition 
and the identity 


f (y.,9|D) = fly, | D,O)f (0| D), 
namely by sampling 

y eq E 
foreach j21,..,J . 


In many cases, each sampled nonsample vector y‘’’ here can obtained 
by sampling 

y? ~L f(y,|D,0?), ione. N, 
and then mum the vector according to 


yin = G) CG) 
= (Yai 5 YN < 


Step 3. Form the completed population vector 


y= Dy. tÍ) 
DYI E= Yay ose 


and then (d 
y = gy? 99 
for each j =1,...,J. 


The result will be a sample from the posterior/predictive distribution of 
y , namely 


P at ~ üd f(w|D). 


Step 4. Use the sample obtained in Step 3 to perform Monte Carlo 
inference on y in the usual way. Thus, estimate the posterior/predictive 


mean of y , namely 


7 = EW |D)= |w fy | Ddy 
(which may be ae to obtain analytically), by the Monte Carlo 


sample mean y = “yy (which is unbiased, in that E(w | D) =% ). 


j=l 
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Also calculate the 1— & CI for V given by 


S i : 
[vens xj where s, = o -yy. 


ja 


Also, estimate the 1—q@ central posterior/predictive density region 
(CPDR generally) for y by (Q,,,,Q, 4/2), where Q, is the empirical p- 
quantile of the sample y®,... y^? . 


Also, estimate the entire posterior/predictive density of y , namely 
f(|D), by f(w|D), a smooth of a histogram of y®,.. y 


(obtained by adjusting the smooth parameters). 


Use Rao-Blackwell methods to improve precision, if possible and 
practicable. For example, suppose that q = 2,0 = (0,0,), v =g(y,0,), 


and f(y|D,@,) has a simple form. Then, instead of using a ‘histogram 


estimate' f (w | D) to estimate f (v | D), use the Rao-Blackwell estimate 


Fv 1p) = 7 Y ftv| D, 9. 


Exercise | 1.1 Estimation of nonstandard target quantities 


(a) Suppose that 2.1, 5.2, 3.0, 7.7 and 9.3 constitute a random sample 
from a normal finite population of size 20 whose mean and variance are 
unknown. We are interested in the finite population median. Estimate 
this quantity using a suitable Bayesian model. 


(b) Repeat (a) but for the quantity: 
average percentage increase between subsequent ordered 
population values greater than 4. 


(c) Repeat (a) but for the quantity: 


sum of finite population values in the upper quartile of the 
normal superpopulation. 
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Solution to Exercise | 1.1 


The Bayesian model here is: 


N E 
rela -[ | 


S — (1... n), (L... ,n-1,n +1),... (N -n +1,..., NF) (SRSWOR) 
(yy Yu 1154) ~ iid N(í,1/ 2) 
f (4, 4)«1/A, ue98,A»0, 
where N = 20, n = 5, and where the data is 
D=(s,y,)=((L...,n),(2.1, 5.2, 30, 7.7, 9.3)). 


Note 1: This data is presented according to a convenient reordering of 
population labels, after sampling, so that the sampled values are listed 
at the beginning of the finite population vector (as discussed earlier). 


Note 2: The superpopulation parameter in the model may be thought 
of as the vector 


0 - (06,0) - (4,4), 
in which case the model could also be written: 
(s| y,0) ~ SRSWOR(N,n) 


Cv gy NOTIN /0,) 
OCS AAA 


For the purposes of this exercise, let y,,, denote the ith finite population 


order statistic, meaning the ith value amongst y,,..., y, after these are 


ordered from smallest to largest. We are interested in three finite 
population quantities, as follows: 


Joost dun 
(a) v, =9,(y,0) = g (y) - 9 — mn 


2 
N 
i M JT E 4) 
i=2 Yai- 
(b) v, = g,(y,0)= g,(y) =100 = 
Y 10, 24 
i-2 


(o) v, = g(y,0) = Y E ^H eq 079) . 


i=1 
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Note 1: The median y, is the average of the middle two values, since 
N = 20 is even. 


Note 2: In general, y, is defined only if at least two of the finite 


population values are greater than 4. For our data, there is no problem 
with the definition because the observed sample already contains three 


such values. If there were a problem, then y, = g,(y) could be 
defined as zero (say) in the case where the number of population 
values is only 0 or 1, i.e. if £X, I(y, » 4) «2. 


Note 3: As regards y,, if c is the upper quartile of the normal 
superpopulation then 


0.75 = Py, <eo = p| =H <A 
oO oO 


= t .81(075) 
oO 


1 
—c-ucoo1(0.75) = u + ——0 (0.75). 


Va 


In each case, the inferential target has a posterior/predictive distribution 
which cannot be obtained analytically. One way to proceed is as follows: 


3 


Step 1. Generate 4,,....,4, ~ iid f(A|D) ~ EE nts), 


where s? 22) -yy. 
n ja 


(This step derives from results for the normal-normal-gamma model.) 


1 
Step 2. Generate 4 ~ f(u|D,A;) * vr foreach j =1,...,J. 
na. 
J 
(This step derives from results for the normal-normal model). 
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Step 3. For each j =1,...,J: 


| 1 
e Generate y"),..., yc) ~ iid f(y,| D, uj, A) ~ Nat 
j 
* Form y” = (yt... y^) and 
y» sys y y hes Ved a 
* Calculate y? = g(y,0™), where 0 = (1, À;). 


Step 4. Use the values v ,..., y" ~ iid f(y |D) for Monte Carlo 
inference on y in the usual way. 


Note 1: Steps 1 and 2 result in the sample 
(4, Às- 1,45) lid f(,4|D). 


Note 2: In the above, Steps 1 and 2 could be replaced as follows: 


Step 1’. Generate 44,...,44, ~ f (u| D) foreach j-1,..,J . Do this by 
first sampling w,,..., w, ~ iid t(n — 1) and then forming 
Hj; y, *WsS, / Jn foreach j =1,...,J 
(using results from the normal-normal-gamma model). 
nn 


Step 2’. Generate 4; ~L f (4| D, uj) ~ es) where 


il n 
s, E => (y; =) 
ni 
(using results from the normal-gamma model). 


These modified steps will also result in the sample 
(4 es ume A;) du iid f G.A | D) $ 


Applying the above four-step procedure (using the original Steps 1 and 
2) with Monte Carlo sample size J = 1,000, we obtain Table 11.1 which 
shows numerical estimates for the three quantities of interest: 


V —Vy/4, W, and y,, respectively. Figure 11.1 shows histograms which 
illustrate these inferences. 
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Table 11.1 and Figure 11.1 also contain analogous results for a fourth 
quantity of interest which may be defined as 


V, 7 gi(y,0) 7 (v, UA z 0) 
= p yl b >u ege PXE > eges) > j. 


The relevant posterior/predictive density may also be written 
f (v, |D)- f (v, |D,y, #0). 


Inferences on y, were obtained using the 960 values of y, which were 


non-zero. It was meaningful to perform this additional inference because 
there were 40 simulations amongst the 1,000 for which the upper 
quartile of the normal distribution lay above the largest finite population 
value, resulting in the sum y, being equal to 0 exactly. 


Note 1: From the above, we see that y, is neither a discrete nor a 
continuous random variable but one with a mixed distribution. 
The discrete part of this mixed distribution is the probability that 


W, =0 exactly, and this we estimated via MC as 40/1,000 = 0.04. 


Note 2: We also see that neither y, nor y, is necessarily positive. 


This is because it might be the case that the upper quartile of the 
normal distribution is negative and many of the finite population 
values happen (by a very small chance) to lie between that negative 
quartile and zero. 
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Table 11.1 Point and interval estimates for four quantities 


Quantity of interest: 


Vi V, V. Y, = VA UA z 0) 


Posterior mean estimate: 
5.842 9.975 58.31 60.74 


9596 CI for posterior mean: 
(5.790, 5.893) | (9.775, 10.175) (56.48 60.15) (58.99, 62.49) 


Posterior mode estimate: 
5.528 8.150 62.29 62.45 


Posterior median estimate: 
5.769 9.377 59.48 60.59 


9596 CPDR estimate: 
(4.308, 7.528) (5.522. 17,770) (0.00 114.87) (11.72, 114.96) 


Figure 11.1 Four histograms and sets of inferences 


Monte Carlo inference on psi1 
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Monte Carlo inference on psi3 


[=] 1 LE 1 
N 
= . Posterior mode | i —— Posterior mean, 95% CI 
= o 1 & median & 95% CPDR 
$ o 
Q o 
e 
e 
o 
eo 
0 50 100 150 
psi3 
Monte Carlo inference on psi4 = (psi3 given psi3 !z 0) 
a s L 
S . Posterior mode ——.. Posterior mean, 9596 CI 
= - & median & 95% CPDR 
$ ò 
O o 
e 
e 
S i T T ] 
0 50 100 150 
psi3, psi4 


R Code for Exercise 11.1 
options(digits=4) 


# Define 3 psi functions ----------------- 
PSI1FUN = function(y){ quantile(y,0.5) } 
PSI2FUN = function(y){ ynew=sort(y[y>4]); nnewzlength(ynew); 
if(nnew<2) res=NA 
if(nnew>=2) res =100*mean( (ynew[-1]-ynew[-nnew]) / ynew[-nnew] ) 
res } 
PSI3FUN = function(y,mu,lam){ q = qnorm(0.75); sum(y[y>(mu+q/sqrt(lam))]) 
} 


# Test 3 psi functions ------------------------- 

PSITFUN(y-c(1,2,7)) # 2 OK 

PSITFUN(y-c(1,2,7,8)) # 4.5 OK 

PSIZ2FUN(y2c(5,12,6)) #60 Correct: 100* (1/2) * ( (6-5)/5 + (12-6)/6 ) = 60 
PSIZ2FUN(y2c(5,3,6)) #20 Correct: 100* (6-5)/5 = 20 

PSI2FUN(y2c(5,22,3) € NA Correct 

PSIZFUN(y=c(4,4,-3)) # NA Correct 

set.seed(311); PSI3FUN(yzrnorm(100,10,1),mu-z10,lamz1) & 267 ~ 25*10, OK 
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# Perform inference on 3 psi functions ---------------------------------------- 
ys= c(2.1, 5.2, 3.0, 7.7, 9.3); ysbarzmean(ys); n=length(ys); ss2=var(ys); N = 20 
options(digits=4); J=1000; set.seed(232) 
lamvec-rgamma( J, (n-1)/2, ((n-1)/2) *ss2 ) 
muvec = rnorm(J,ysbar, 1/sqrt(n*lamvec)) 
yrmat=matrix(NA, nrow=J, ncol=N-n) 
for(j in 1:)) yrmat[j,] = rnorm(N-n,muvec,1/sqrt(lamvec)) 
psilvec=rep(NA,J); psi2vec=rep(NA,J); psi3vec=rep(NA,J) 
for(j in 1:))(. yrj = yrmat[j,] 
psilvec[j] = PSITFUN(y-c(ys, yrj)) 
psi2vec[j] = PSIZFUN(y= c(ys, yrj)) 
psi3vec[j] = PSISFUN(y= c(ys, yrj), muzmuvec[j], lamzlamvec[j]) ) 


cbind( summary(psilvec), summary(psi2vec), 
summary(psi3vec), summary(psi3vec[psi3vec!z0]) ) 

H Min. 3.14 4.44 0.0 9.3 

H 1st Qu. 5.28 7.65 37.9 40.3 

4 Median 5.77 9.38 59.5 60.6 

# Mean 5.84 9.97 58.3 60.7 

# 3rd Qu. 6.41 11.50 79.6 80.7 

4 Max. 9.09 28.10 156.0 156.0 


X11(w=9,h=6.5); par(mfrowzc(2,1)) 

psivec=psilvec; J = length(psivec) 

psibar=mean(psivec); psici=psibar+c(-1,1)*qnorm(0.975)*sd(psivec)/sqrt(J) 

fpsi=density(psivec); psimode=fpsiSx[fpsisy==max(fpsiSy)] 

psimedian=quantile(psivec,0.5); psicpdr=quantile(psivec,c(0.025,0.975)) 

c(psibar, psici, psimode, psimedian, psicpdr) 

# 5.842 5.790 5.893 5.528 5.769 4.308 7.528 

hist(psivec, prob=T, xlab="psi1",xlim=c(0,10),ylim=c(0,0.6), 
breaks=seq(0,10,0.25), main="Monte Carlo inference on psi1") 

lines(fpsi,Iwdz3) 

abline(v= c(psibar, psici, psicpdr, psimedian, psimode) , 
Ity=c(1,1,1,1,1,2,2), Iwdzrep(2,7)) 

legend(0,0.6, 

c("Posterior mean, 95% CI \n & 95% CPDR","Posterior mode & median"), 
Ity=c(1,2), lwd=c(2,2), bg="white") 


psivec=psi2vec; J = length(psivec) 

psibar=mean(psivec); psici=psibar+c(-1,1)*qnorm(0.975)*sd(psivec)/sqrt(J) 
fpsi=density(psivec); psimode=fpsiSx[fpsisy==max(fpsiSy)] 
psimedian=quantile(psivec,0.5); psicpdr=quantile(psivec,c(0.025,0.975)) 
c(psibar, psici, psimode, psimedian, psicpdr) 

#9.975 9.775 10.175 8.150 9.377 5.522 17.770 
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hist(psivec, prob=T, xlab="psi2",xlim=c(2,30),ylim=c(0,0.17), 
breaks=seq(0,30,0.5),main="Monte Carlo inference on psi2") 
lines(fpsi,lwd=3) 
abline(v= c(psibar, psici, psicpdr, psimedian, psimode) , 
Ity=c(1,1,1,1,1,2,2), lwd=rep(2,7)) 
legend(15,0.15, 
c("Posterior mean, 95% Cl & 95% CPDR","Posterior mode & median"), 
Ity=c(1,2), lwd=c(2,2), bg="Wwhite") # End of first 2 graphs 


psivec-psi3vec # Start of next 2 graphs 

psibar=mean(psivec); psici=psibar+c(-1,1)*qnorm(0.975)*sd(psivec)/sqrt(J) 
fpsi=density(psivec); psimode=fpsiSx[fpsisy==max(fpsiSy)] 
psimedian=quantile(psivec,0.5); psicpdr=quantile(psivec,c(0.025,0.975)) 
c(psibar,psici,psimode,psimedian,psicpdr) 

H 58.31 56.48 60.15 62.29 59.48 0.00 114.87 


hist(psivec, prob=T, xlab="psi3",xlim=c(0,160),ylim=c(0,0.022), 
breaks=seq(0,200,5), main="Monte Carlo inference on psi3") 
lines(fpsi,lwd=3) 
abline(v= c(psibar, psici, psicpdr, psimedian, psimode) , 
Ity=c(1,1,1,1,1,2,2), lwd=rep(2,7)) 
legend(100,0.022, 
c("Posterior mean, 95% CI \n& 95% CPDR"),|ty=1,lwd=2,bg="white") 
legend(-5,0.022,c("Posterior mode \n& median"), Ityz2, lwd=2, bg="White") 


length(psi3vec[psi3vec!=0]) # 960 

length(psi3vec[psi3vec==0]) #40 40/1000 = 4% 
psivec=psi3vec[psi3vec!=0]; J=length(psivec); J #960 Condition on psi » O 
psibar=mean(psivec); psici=psibar+c(-1,1)*qnorm(0.975)*sd(psivec)/sqrt(J) 
fpsi=density(psivec); psimode=fpsiSx[fpsisy==max(fpsiSy)] 
psimedian=quantile(psivec,0.5); psicpdr=quantile(psivec,c(0.025,0.975)) 
c(psibar, psici, psimode, psimedian, psicpdr) 

# 60.74 58.99 62.49 62.45 60.59 11.72 114.96 


hist(psivec, prob=T, xlab="psi3, psi4",xlim=c(0,160),ylim=c(0,0.022), 
breaks=seq(0,200,5), 
main="Monte Carlo inference on psi4 = (psi3 given psi3 != 0)") 
lines(fpsi,lwd=3) 
abline(v= c(psibar, psici, psicpdr, psimedian, psimode), 
Ity=c(1,1,1,1,1,2,2), lwd=rep(2,7)) 
legend(100,0.022, 
c("Posterior mean, 95% Cl \n& 95% CPDR"),Ityz1,Iwdz2,bgz" white") 
legend(-5,0.022,c("Posterior mode \n& median"), Ityz2, lwd=2, bg="white") 
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11.2 Data transformations 


In statistical analysis, a common practice is to first transform the data 
before applying a model. For example, if the data values are strictly 
positive and highly right skewed, it may be worthwhile taking natural 
logarithms before applying a normal model. 


In the classical setting, e.g. in the design-based survey sampling, this 
idea may work well for purposes of analytical inference (i.e. estimation 
of model parameters) but can be problematic for prediction. This is 
because the quantity requiring prediction (e.g. the nonsample total) does 
not typically have a simple distribution on the untransformed scale. 
Although prediction can be performed easily on the transformed scale 
there is no way to translate results back onto the original scale. By 
contrast, this issue does not create any special problems within the 
Bayesian framework. 


Suppose that we are interested in some finite population quantity which 
is denoted y= g(y), eg. y - 1dyy/N. 


Also suppose that there is no convenient superpopulation model for the 
finite population values y,, i—-1,.., N , but there does exist such a 


model for some function of those values, say z; = h(y,) for a function h. 


In that case we may consider a Bayesian model specified in terms of: 
f(€|z,@) where € issor Ior L (as discussed earlier) 


f (z|0) where z =(Z,,Z,) = (25... Za) (Zm Zy) = (Zp Zy) 
f (8) where 0 = (0,...,0,). 


We now use Monte Carlo methods (perhaps MCMC methods if needed) 
to generate a random sample from the predictive distribution of the 
nonsample vector for the z variable (i.e. z,), given the data D (for 


example (s, y,) , (1,s) or (L, y,)). Let us call this sample 
25 um nd TU d). 


We next calculatate y? 2h (z?) for each i2 n-1.., N and each 


j=1,...,J . Thus, we untransform the simulated individual data values 
back to the original scale. 
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Next, we form the vectors 
yr! = Qnis YN. 
and 
y" sys) 
foreach J21,.4J . 


This results in the samples 

Ve sy Pid FC. (D) 
and 

y ss y? nd f(y|D): 


Finally, we calculate 
y O = g(y?)) 
foreach j =1,...,J. 


This results in 


yw iid f(y|D), 
namely a sample from the predictive distribution of the finite population 
quantity of interest, on the original scale required for that quantity. This 
sample can then be used for Monte Carlo inference on v in the usual 


way. 


Note: We may think of this topic as an example and special application 
of the last topic, that is, Bayesian inference on complicated functions 
of the finite population vector. 


Exercise 11.2 Finite population inference using data 
transformation 


Consider the following random sample of size 50 from a finite 
population of size 200: 


28.374, 69.857, 22.721, 57.593, 126.965, 17.816, 16.078, 0.803, 3.164, 3.544, 
2.123, 2.353, 184.539, 59.856, 63.701, 585.684, 29.094, 79.245, 18.105, 1.623, 


5.513, 1.629, 63.654, 22.060, 187.463, 5.051, 34.299, 27.475, 0.746, 34.016, 
8.547, 1.081, 3.151, 55.569, 2.593, 522.377, 1.660, 130.435, 1.246, 169.462, 
3.444, 6.376, 18.735, 51.312, 33.920, 350.346, 475.795, 4.972, 24.451, 86.987. 


Use Bayesian methods with a suitable transformation to estimate the 
finite population mean. 
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Solution to Exercise | 1.2 


We create a histogram of the sample values and see that the underlying 
distribution is highly right skewed. However, a histogram of the natural 
logarithm of the sample values is consistent with a normal 
superpopulation model. The histograms are shown in Figure 11.2. 


Therefore we posit the following Bayesian model involving an 
uninformative prior and the logarithms of the finite population values, 
z -h(y)z-logy,, i-1,..., N (N = 200): 

(s |Z, 44, 4) ~ SRSWOR 

(Zo Zy 154) tid N(u,1/2) 

f(1,4)«1/A, ue, A >00. 


Figure 11.2 Histograms of the sample data 
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The data is D = (s,z,) = ((1,..., 50), (28.374, 69.857,...,86.987)) (after a 
convenient ordering), and the quantity of interest is 


14 is, 14 
y= _=gq(z)=—)> h (z.)=—) exp(z.). 
y 2257 g(2z) 722 (z) 2 p(z) 


So we generate 


(44 Ajo (i A;) i iid f GA | D) 
(using methods detailed previously). 
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Then for each j =1,...,J we sample 
Zion cdd NGA) 


nilott 


and calculate 


ys x (o SA *lexpz?) exp?) ) 


N 
= Hya +> es^). 


i=n+1 


The result is 


y" ony° tend f(y |D), 
which can then be used for Monte Carlo inference. 


Applying the above procedure with a Monte Carlo sample size of 
J = 1,000 we estimate y’s posterior mean, y = E(y | D) , and so also y 
itself, by 


< 


y = 110.83, 


<Il 


i 
J j 
with 9596 CI for y 


J 
«19 doy go — yy. | = (104.64, 117.02). 
JAJ-143 


We also estimate the bounds of the 9596 CPDR for y by 49.26 and 


302.05, where these are the empirical 0.025 and 0.975 quantiles of 


S0 50) 
RON : 


Il 
a 


Figure 11.3 shows a histogram of the simulated values of y, together 
with the above five numbers, as well as a ‘histogram estimate’ of the 
predictive density f(y|D). In this histogram the dot shows the true 
value of the finite population mean, y = 114.2, which was known prior 
to the generation of the sample data. 
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Figure 11.3 Inference on the finite population mean 
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Figure 11.4 shows histograms of the values Z,,...,Z, which were in fact 
drawn from the normal distribution with mean 3 and standard deviation 
2 (left plot), and the values of y, = exp(z,),..., y, = exp(z,) (right plot), 
together with the true underlying superpopulation densities of the 
variables z, and y;. 


Figure 11.4 Histogram of all N values of z and y 
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For comparison we repeat the above inference on the original scale of 
the data and 'exactly' (since there is then no need for Monte Carlo 
methods). 


In that case—where we replace z by y in the Bayesian model—we find 
that the predictive mean of y is y = E(y|D) = y, = 74.15 (the average 
of the raw data values), and the 95% CPDR for y is exactly (41.36, 
106.94). We see that this inference does much worse at estimating y, 
whose true value is 114.2. 


Note: This second set of inference is the same as design-based 
inference since it is based on the result 


— lp |- n-d, where 5? = Y - x. 
n iple bum 


Jn N 


Figure 11.5 shows the original data values (untransformed) and both sets 
of inferences above. It highlights the value of performing an appropriate 
prior transformation for purposes of estimating the finite population 
mean. 


Figure | 1.5 Comparison of two sets of inference 
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For interest, we repeat the above simulations and comparison with a 
N(2,1) model for the z,s (rather than a N(3,4) model). Figure 11.6 shows 
the analogue of the last figure above. 


We see, of course, that the benefits of applying the log transformation to 
the data diminishes as the skewness of the sample data decreases. 


Figure | 1.6 Comparison of two sets of inference with less 
skewed data 


7 7 Inference using original scale 
—— Inference using log transformation 


The dot shows 11.7, the true value 
of the finite population mean 


Density 
0.00 0.05 0.10 0.15 0.20 


[] [1T] 


Note 1: Using the formula for sample skewness given by 
v Q/n) EQ, - XY 
UDE E 
we obtained a value of g = 2.662 for the first set of data and a value of 
g = 1.549 for the second set of data. 


Note 2: For another example of finite population inference via 
Bayesian and MCMC methods which involves the logarithmic 
transformation, see Puza (2002). This other example also features the 
use of covariate information. 


Note 3: It can be shown (mathematically) that y = E(y | D) 2o 
(exactly). This seems somewhat counterintuitive in light of the fact 
that our Monte Carlo estimate y = 110.83 is very close to the actual 
finite population mean, y = 114.2. 
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R Code for Exercise 11.2 
# Data generation used to set up exercise -------------------------------------------- 


options(digits=4); X11(w=8,h=6); par(mfrowzc(2,2)) 

N=200; n=50; set.seed(432); Z=rnorm(N,3,2); S=sample(1:N,n) 

ZS-Z[S]; Y=exp(Z); YS=exp(ZS); YBAR=mean(Y); YBAR # 114.2 

hist(Z,prob=T); hist(Y,prob=T); hist(ZS,prob=T); hist(YS,prob=T) 
# preliminary plots 


X11(w=8,h=4); par(mfrow=c(1,2)) 

hist(Z,prob=T,xlim=c(-4,10), ylim=c(0,0.25),breaks=seq(-3,12,0.5)) 
lines(seq(-5,12,0.01),dnorm(seq(-5,12,0.01),3,2),lwd=3) 

hist(Y,prob=T,xlim=c(0,600),ylim=c(0,0.08), breaks=seq(0,5000,10)); 
yg=seq(0.1,700,0.5); lines(yg ,dnorm( log(yg),3,2)/yg, lwd=3) 


format(list(YS=YS),digits=3) # "28.374, 69.857, 22.721, ..., 24.451, 86.987" 


# Look at given data and the log of that data (load data etc.) ------------------ 

N = 200; n = 50; m = N-n; options(digits=4) 

ys=c( 28.374, 69.857, 22.721, 57.593, 126.965, 
17.816, 16.078, 0.803, 3.164, 3.544, 
2.123, 2.353, 184.539, 59.856, 63.701, 
585.684, 29.094, 79.245, 18.105, 1.623, 
5.513, 1.629, 63.654, 22.060, 187.463, 
5.051, 34.299, 27.475, 0.746, 34.016, 
8.547, 1.081, 3.151, 55.569, 2.593, 
522.377, 1.660, 130.435, 1.246, 169.462, 
3.444, 6.376, 18.735, 51.312, 33.920, 
350.346, 475.795, 4.972, 24.451, 86.987) 


summary(ys) 
# Min.1st Qu. Median Mean 3rd Qu. Max. 
# 0.7 3.5 23.6 74.2 63.7 586.0 


skewness=mean(_ (ys-mean(ys))^3 )/ ( mean((ys-mean(ys))^2) )^(3/2) 
skewness # 2.662 


zs=log(ys); par(mfrow=c(1,2)) 

hist(ys, prob=T); hist(zs, prob=T) # preliminary plots 

hist(ys, prob=T,xlim=c(0,600), ylim=c(0,0.045), 
breaks=seq(0,700,10), main="Sample values"); 

hist(zs, prob=T,xlim=c(-2,8), ylim=c(0,0.35), 
breaks=seq(-3,10,0.5), main="Log of sample values"); 
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# Finite population inference using original scale and design-based approach 
# (same as the 'exact' Bayesian approach without Monte Carlo) ----------------- 
ysbarzmean(ys); sy=sd(ys); ybarhatzysbar 
ybarci=ybarhat+c(-1,1)*qt(0.975,n-1)* (sy/sqrt(n)) * sqrt(1-n/N) 
inf.original=c(ybarhat,ybarci); 

c(inf.original, YBAR) # 74.15 41.36 106.94 114.24 


# Finite population inference via Bayesian approach using log transformation 
# (and a 'crude' approach which makes no use of Rao-Blackwell ideas etc.) ---- 


zsbar=mean(zs); sz=sd(zs); J=1000; set.seed(142); 
lamvec-rgamma(J,(n-1)/2,(sz^2)*(n-1)/2) 
muvec=rnorm(J,zsbar,1/sqrt(n*lamvec)); yrbarvec=rep(NA,J) 


for(j in 1:J){ zr=rnorm(m, muvec[j], 1/sqrt(lamvec[j]) ) 
yr=exp(zr); yrbarvec[j]  mean(yr)  ) 

ybarvec=(1/N)*(n*ysbar+m*yrbarvec); ybarhat=mean(ybarvec) 

ybarci=ybarhat+c(-1,1)*qnorm(0.975)*sd(ybarvec)/sqrt(J) 

ybarcpdr=quantile(ybarvec,c(0.025,0.975)) 

inf.transform = c(ybarhat,ybarci,ybarcpdr) 

c(inf.transform,YBAR) # 110.83 104.64 117.02 49.26 302.05 114.24 


summary(ybarvec) 
H Min. 1st Qu. Median Mean 3rd Qu. Max. 
# 37.0 70.6 89.4 111.0 122.0 2080.0 


par(mfrow=c(1,1)); hist(ybarvec,prob-T) # preliminary plot 
hist(ybarvec,probzT,xlimzc(0,600),ylimzc(0,0.015), 
breaks=seq(0,3000,10), mainz" "); 

abline(vzinf.transform,Ityz1,Iwdz2); points(YBAR,O,pchz16) 
legend(310,0.015,c("Inference using log transformation"),Ityzc(1),|wdzc(2)) 
text(450,0.01, 

"The dot shows 114.2, the true value \nof the finite population mean") 
lines(density(ybarvec),lwd=2) 


par(mfrow=c(1,1)); hist(ys,prob=T) # preliminary plot 
hist(ys, prob=T,xlim=c(0,600), ylim=c(0,0.045), breaks=seq(0,700,10), main=" "); 
abline(vzinf.original,Ityz2,Iwdz2); abline(v=inf.transform,|ty=1,lwd=2) 
points(YBAR,O,pchz16) 
legend(310,0.04,c("Inference using original scale", 
"Inference using log transformation"), Ity=c(2,1),lwd=c(2,2)) 

text(450,0.02, 

"The dot shows 114.2, the true value \nof the finite population mean") 
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# Repeat with ‘less extreme' lognormal data ----------------------------------------- 


N=200; nz50; set.seed(432); Z=rnorm(N,2,1); S=sample(1:N,n) & «- difference 
Zs-Z[S]; Y=exp(Z); YS=exp(ZS); YBAR=mean(Y) 
X11(w=8,h=6); par(mfrowzc(2,2)); 
hist(Z,prob=T); hist(Y,prob=T); hist(ZS,prob=T); hist(YS,prob=T) 

# preliminary plots 
ys = YS; zs=log(ys); 
skewness=mean( (ys-mean(ys))^3 )/ ( mean((ys-mean(ys))^2) )^(3/2) 
skewness # 1.549 
ysbarzmean(ys); sy=sd(ys); ybarhatzysbar 
ybarci=ybarhat+c(-1,1)*qt(0.975,n-1)* (sy/sqrt(n)) * sqrt(1-n/N) 
inf.original =c(ybarhat,ybarci); 
c(inf.original, YBAR) # 10.541 8.177 12.906 11.698 


zsbar=mean(zs); sz=sd(zs); J=1000;  set.seed(142); 
lamvec=rgamma(J,(n-1)/2,(sz*2)*(n-1)/2) 
muvec=rnorm(J,zsbar,1/sqrt(n*lamvec)); yrbarvec=rep(NA,J) 


for(j in 1:J){ zr=rnorm(m, muvec[j], 1/sqrt(lamvec[j]) ) 
yr=exp(zr); yrbarvec[j] =mean(yr)  ) 


ybarvec=(1/N)*(n*ysbar+m*yrbarvec); ybarhat=mean(ybarvec) 
ybarci=ybarhat+c(-1,1)*qnorm(0.975)*sd(ybarvec)/sqrt(J) 
ybarcpdr=quantile(ybarvec,c(0.025,0.975)) 

inf.transform = c(ybarhat,ybarci,ybarcpdr) 

c(inf.transform,YBAR) # 11.006 10.904 11.108 8.478 15.016 11.698 


X11(w=8,h=4); par(mfrowzc(1,1)) 
hist(ys, prob=T) # preliminary plot 
hist(ys, prob=T,xlim=c(0,40),ylim=c(0,0.2), breaks=seq(0,40,1), mainz" "); 
abline(vzinf.original,Ityz2,Iwdz2); abline(v=inf.transform,|ty=1,lwd=2) 
points(YBAR,O,pch=16) 
legend(20,0.2, 
c("Inference using original scale", "Inference using log transformation"), 
Ity=c(2,1),lwd=c(2,2)) 
text(30,0.1, 
"The dot shows 11.7, the true value \nof the finite population mean") 
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11.3 Frequentist properties of Bayesian finite 
population estimators 


We have previously studied the frequentist characteristics of Bayesian 
estimators. That was in the context of analytic inference (i.e. inference 
on model parameters) and based on a random sample from a 
hypothetically infinite population (e.g. a normal distribution). We will 
now generalise those ideas in the broader framework of a Bayesian finite 
population model. 


As before, we are primarily interested in the frequentist characteristics of 
Bayesian estimators which are based on uninformative priors and used 
as proxies for classical or design based estimators. Nevertheless we will 
consider both types of prior (informative and uninformative). 


Consider a Bayesian finite population model specified in terms of: 
f(é|y,@) where é issor Ior L (as discussed earlier) 
f (y|80) where y - (y, y) = (oS Yn) Qao Y = Os Y) 
f (8) where 0 = (0,...,0,). 


Also suppose that the data is 
D= (s, y,) or (I,s) or (L, y,) 
(as the case may be), and the quantity of interest is 
y —g(0, y) 
(generally) or w = g(0) (as considered previously for ‘pure’ analytic 
inference) or y = g( y) (the case of ‘pure’ finite population inference). 


Now suppose that in the context of this general model, data and quantity 
of interest, we derive a point estimate for y (such as the posterior mean, 
mode or median) of the form 

V =y(D) 
and a 1— « interval estimate for w (such as the CPDR or HPDR) of the 


form 
I =(L,U)=1(D) =(L(D),U(D)). 


Note: If the sampling mechanism is defined in terms of I = (1,,..., Iy), 


the vector of inclusion counters, there is a conflict of notation and one 
of these quantities needs a different symbol. 
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In the above context, there may be interest in the frequentist bias of y 
and the frequentist coverage probabilities of the interval I, especially if 
these estimators are intended as proxies for classical ones. 


However, because there is now an extra level in the Bayesian model 
hierarchy relative to previously, in the form of the density defining the 
sampling mechanism, namely 


f(Sly,9), 
there are two ways (at least) of defining the required frequentist 
characteristics: 


e model-based, meaning conditional on 0 and & 
* design-based, meaning conditional on @ and y. 


For definiteness, suppose that the data is D = (s, y,). Then we define: 


* the model bias of y as 
B, s = E,{ws, ys) -y (y, 0) | 0, s} 


as} 


* the model coverage probability of I as 
Co = PAW, 0) € I(s, y.) | 0, s} . 


* the relative model bias of i as 


bd [Eee 
"o C[ — v0) 


Also, we define: 


* the design bias of y as 
B, , - E, (s. y.) -w(y,0)10, y] 


| 


* the design coverage probability of I as 
Cus = P.{w(y, 0) e I(s, y,)|0, y). 


* the relative design bias of y as 


" o ain 
"o | WOO) 
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Note 1: Each of the three model-based characteristics is an expectation 
with respect to the distribution of y given @ and s. Each of the three 
design-based characteristics is an expectation with respect to the 
distribution of s given @ and y. 


Note 2: Analogous definitions apply if D = (I, y,) or D=(L, y,), etc., 


noting that s is a function of I and L, there is a one-to-one 
correspondence between J and s under sampling without replacement, 


etc. For instance, if D = (I, y,), we define the model bias of i as 
B,, - EJ, y.) -v(y,0)] 6,1), 

and when D =(L, y,), we define the model bias of yw as 
By, = E,{W(L, y.) - v (y, 0) | 0, L}, etc. 


Note 3: If a model-based characteristic such as the model bias B,, is 
be the same for all possible samples s, then s may be dropped from the 
subscript; e.g. we may instead write B,. Likewise, if a design-based 
characteristic such as the design bias B,, is the same for all possible 


values of the model parameter 0 , then 6 may be dropped; e.g. we 
may write B, . 


Note 4: If a model-based or design based characteristic cannot be 
evaluated analytically then it may be possible to estimate via a Monte 
Carlo simulation. This idea features in the next exercise below. 


Note 5: The model bias of WY above is a generalisation of the 
frequentist bias of an estimator as defined earlier and based on a 
random sample from an infinite population (e.g. a normal distribution). 
The following argument illustrates. Suppose that y — 0, v — y, (the 
sample mean) and the sampling mechanism is SRSWOR. Then, by the 
above definitions, the model bias of y is 
B,, = E (Vs, y,) -w(y,0)|0,s) (generally) 
- E, - 016,5) = E.G, (6,5) -6, 
where 
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E,(¥,|0,s) - [X f(y 18,s)dy . 
Now, in this case, 


f(s|y,0)= f(s)= 5) forall ssl nhss0N- —n LN), 
n 
so that 


f (16,5) f(y,0,s) = F(Y, AFYON f (O)sclx f Cy |) x1, 
and therefore 


f(y16,s)2 f(y]0) 7 f(y.y,10) 7 FO, IBY) f Cv, 10), 
with s fixed at its observed value. 


From these observations we see that 
EQ, 16s) | |y. fv. 16. y. f Gr, 160ay,ay, 


= [ f, 18, dy, x [ v. f 9, Ody, 
IKEY: 


Therefore B,, = E(y,|0) - 0 - E(y, -0|0). 


We have shown that the model bias here is the same as the bias of y, 


in the earlier non-finite population context (where s did not feature in 
the notation). 


This is an example of where s could be dropped from the subscript in 
B, ,, i.e. where this could also be written B,. 


If the sampling mechanism in this illustration were nonignorable, with 
f (s| y, 9) depending on y in some way, then the simplifications above 


might not be possible and the bias might need to be evaluated, with 
more difficulty, according to the formula 


= s f(y,9,s) 
B, =-0+ [y,f(y|6,dy =-0+ [y, 9^ oS) 
,7-0* [x.fo16.3ay =-0+ [y a 


where: f(6,s) =| f (y,6,s)dy 
f (y,0,5) - f(s1y,0) f Cy | 0) f (0) , etc. 
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Note 6: The design bias of y above is a generalisation of the bias of 


an estimator in the classical survey sampling context where a sample is 
drawn from a finite population of values which are thought of as 
constants. The following argument illustrates. Suppose that y = y (the 


finite population mean), y = y, (the sample mean) and the sampling 
mechanism is SRSWOR. Then, by the above definitions, the design 
bias of y is 
B,, » E, (s. y,) -w(y,0)]0, y) (generally) 
- Ey, -y|6,y) - E(y,|0. y) - y. 


% = 1 
Now, EQ.16.) - 2,Y.f1y,6) = 1, t+ Y.) 
fl N 
where f(s|y,0)=— and k=[ 
k n 
1l 
Tag Quote Ye Ones tet Ya) 


Here, expression { ] contains a total of kn terms, with each of 


yy» Yy is represented equally often and therefore kn/N times. 


We see that. | JDO tet yy) = kay. 


= i emt 
Thus Ey Ny) ih ro 
andso B,, =E,(y,|@,y)-y=y-y=0. 


We have here simply followed through with our general definitions 
and notation to show that under SRSWOR the sample mean is 
unbiased for the population mean. 


If the sampling mechanism were nonignorable, with f(s|y,0) 
depending on y in some way, then the bias of the sample mean might 
need to be evaluated, with more difficulty, according to the formula 


_ NL a oe Ga) 
pl s e DECORE CU. 
" Dom f (s|y,0) T 7037 


where f(y,0) - 5, f(s.y,0), f(s, y,0)= fG|y,0) f Cy 10) f(A), etc. 
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Exercise 11.3 Frequentist properties of Bayesian estimators in 
a normal finite population model 


Consider a sample of size n = 20 taken from a finite population of size 
N = 100 according to SRSWOR, where the population values are normal 
with mean yz = 10 and variance o^ =1// = 4, with prior given by 

f (4, 4)€1/A4, un € 98,4 » 0 (uninformative). 


(a) Using these specifications, generate a finite population vector 
y =(Y Y), take the sample vector as y, = (y,,..., y,), and then use 


Monte Carlo (MC) methods with a sample size of J = 1,000 to estimate 
the superpopulation signal to noise ratio defined by y = u/o. 


Report a point estimate of y in the form of a MC estimate of the 
posterior mean 7 -E(y|D) where D=(s,y,) is the data, and an 
interval estimate in the form of a MC estimate of the 95% CPDR for y. 
(Do not bother to calculate a 9596 CI for 7 .) 


What is the difference between your point estimate and y? Does y lie 
inside the interval? Calculate 7, the MLE of y and report the difference 
between / and y. 


Illustrate your inferences by drawing a suitable histogram of the 
simulated values of y, marked over with the various estimates. 


(b) Perform the procedure in (a) K = 100 times independently, with K 
different finite populations but the sample always consisting of the first 
n values in that finite population. 


Based on your results, estimate the model bias and relative model bias of 


your point estimator, and the model coverage of your interval estimator. 
Also estimate the model bias and relative model bias of the MLE y. 


Illustrate your results by drawing a suitable histogram of the K simulated 
MC estimates, marked over with the various relevant quantities. 


(c) Repeat (b) but with K = 5,000 and discuss. 
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(d) Generate a finite population, vector y = (y,,..., Yy), and then take a 


sample from the finite population via SRSWOR. Then use MC methods 
with sample size J = 1,000 to estimate the finite population ratio of 


largest value to median, which is given by the formula 
u J'aoo) 


v= ond 
(Y so an Yy) /2 
where y, is the ith order statistic for the N population values y;,..., yy . 


Report a point estimate of y in the form of a MC estimate of the 
posterior mean V = E(w | D) and an interval estimate in the form of a 
MC estimate of the 95% CPDR for y . (Do not bother to calculate a 


95% CI for V .) 


What is the difference between your point estimate and y ? Does vy lie 
inside the interval? 


Illustrate your inferences by drawing a suitable histogram of the 
simulated values of y , marked over with the various estimates. 


(e) Perform the procedure in (d) K = 100 times independently, with K 
different samples taken from the same finite population. 


Based on your results, estimate the design bias and relative design bias 
of your point estimator, and the design coverage of your interval 


estimator. 


Illustrate your results by drawing a suitable histogram of the K simulated 
MC estimates, marked over with the various relevant quantities. 


(f) Repeat (e) using two other point estimators, respectively. 


Solution to Exercise 11.3 
(a) A finite population of size N = 100 from the N( u = 10, o^ = 4) 


distribution was generated. The sample mean and standard deviation of 
the 100 finite population values were y = 9.932 and s, = 1.907. Figure 


11.7 shows a histogram of these values. 
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Figure 11.7 Histogram of N = 100 finite population values 
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Then the first n = 20 values were taken as a sample from the finite 
population. Figure 11.8 shows a histogram of these sample values. The 
sample mean and standard deviation of the sample values were 
y, = 10.516 and s, = 1.749 . So the MLE of y = u/oc was calculated 


as y = I —y,/s, = 6.011. 


Figure 11.8 Histogram of n = 20 sample values 
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Then a Monte Carlo sample of size J = 1,000 was taken from the joint 
posterior distribution of 4; and A =1/0°, i.e. from f(44,4|D) where 
D -(s,y,) Hence a MC sample of size J was obtained from the 


posterior distribution of y, namely y,,...,v, ~ iid f (y |D). 
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Note: As explained in previous exercises, this was done by: 


-1 n-1 
e first sampling 4,,..., 4, ~ iid of ts) 
e then sampling w,,..., w, ^ iid t(n —1) 
e next forming x, = y, * w;s/ Jn 


* finally calculating y; = TAM : 


The MC sample from y’s posterior was used to calculate the point 
estimate 


J 
yas - 5.925 
J jal 


(the MC estimate of y’s posterior mean) and the interval estimate 
I = (4.115, 7.963) 
(formed by the empirical 0.025 and 0.975 quantiles of 7,,...,7, ). 


Figure 11.9 shows a histogram of the simulated values 7,,...,y, overlaid 
by an estimate of y’s posterior density f(y|D). Also shown in the 
figure are the Bayesian estimates (3 vertical lines), the MLE 7 = 6.011, 
and the true value of y, namely y 2 u/o = 10/2 = 5. We see that the 
true value of y lies in the Bayesian interval estimate, and the difference 


between the Bayesian estimate and the true value is 5.925 — 5 = 0.925. 
Likewise, the MLE is in ‘error’ by 6.011 —5 = 1.011. 


Figure 11.9 Inference on y based on a MC sample (J = 1,000) 
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(b) The procedure in (a) was repeated so as to yield a total of K = 100 
Bayesian estimates 7,,...,7,, as well as K Bayesian interval estimates 
L, Ig and K MLES 7... 7k. 


From these results we estimated the model mean of the Bayesian 
estimate y by 


= I< 
y =— y, = 5.2226, 
Kia 


with 95% CI (for that mean) of 


[pns Fem Ge yy | asss 5.4466). 


Hence we estimated the model bias of 7 by y -y = 0.2226 with 95% 
CI (—0.0014, 0.4466). 


Likewise, we estimated the model mean of the MLE 7 by 


= 1 K 
y =— 5 y, = 5.298, 
jd x 2^ 


with 9596 CI (for that mean) of 


[ +1.96 "m mee yy |=coro 5.526). 


Hence we estimate the model bias of 7 by 7 -y = 0.298 with 95% CI 
(0.0705, 0.5255). 


Thus we also estimate the relative model biases of y and y by 
(F—-y)/y =0.0445 with 95% CI (0.0003, 0.0893) 
(f-y)!y =0.0596 with 9596 CI (0.0141, 0.1051). 


Note: These could also be reported as the percentages (96): 
(7-y)/y 245 with 95% CI (—0.03, 8.9) 
(F-y)!y 260 with 95% CI (1.4, 10.5). 


Also, exactly 91 of the 100 Bayesian interval estimates I,,...,J, actually 
contained the true value y - 5. 
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So we estimate the model coverage of the 9596 CPDR estimate of 7 
(based on a MC sampled size of specifically J = 1,000) as 0.91, with 
9596 CI (for that coverage) 


(0.91+1.96,/0.91(1 —0.91) /100) = (0.854, 0.966). 


Figure 11.10 shows a histogram of the K simulated values of 7,,...,7, 
and related quantities. 


We see that the Bayesian inference appears to have slightly 
outperformed the MLE as regards model bias. 


Note that this applies in a very particular situation, namely one with 
N = 100, n = 20, u = 10, o = 2, and a MC estimation scheme as 
described above with specifically J = 1,000. 


Note: If we were to use a different common sample from each finite 
population (e.g. y, —- (5, Yu; Yis» Yg7)), Or a different sample each 
time, the results would be the same, subject to Monte Carlo variation. 
This might not be the case in a situation where the sampling 
mechanism is nonignorable or where there are covariate values. But as 
a matter of form when calculating model-based properties, we must 
condition on the sample being taken, i.e. on s. 


Figure 11.10 Distribution of K = 100 estimates 
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(c) Repeating (a) and (b) with K = 5,000, we obtained the following 
results: 
Estimate of model bias of 7 is 0.1616 with 95% CI (0.1359, 0.1872) 
Estimate of model bias of 7 is 0.2301 with 95% CI (0.2041, 0.2561) 
Estimate of relative model bias of 7 is 3.2 with 95% CI (2.7, 3.7) (96) 
Estimate of relative model bias of 7 is 4.6 with 95% CI (4.1, 5.1) (96). 


Exactly 4,755 of the 5,000 Bayesian interval estimates I,,...,[, actually 
contained the true value y - 5. 


So we estimate the model coverage of the 95% CPDR estimate of 7 


(based on a MC sample of size J = 1,000) as 4,755/5,000 = 0.951, with 
9596 CI (for that coverage), 


(0.951+1.96,/0.951(1— 0.951) / 5,000) = (0.945, 0.957). 


From these results it appears that both the Bayesian and ML estimators 
are indeed positively biased by several percent, with the Bayesian 
estimator slightly outperforming the MLE. 


It also appears that the model coverage of the Bayesian interval estimate 
is very close to the nominal 9596. 


Figure 11.11 shows a histogram of the 5,000 simulated Bayesian 
estimates and related information. A detail in this figure is shown as 
Figure 11.12. 


Figure | 1.11 Distribution of K = 5,000 estimates 
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Figure 11.12 Detail in Figure 11.11 
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(d) A finite population of size N = 100 from the N( u = 10, o^ = 4) 
distribution was generated. The sample mean and standard deviation of 
the 100 finite population values were y = 9.675 and s, = 2.159. 


A histogram of the values is shown in Figure 11.13. The true value of 
the ratio requiring inference was in this case calculated as 


(Yes + Vien) 2 10.171 


1.536. 


Figure 11.13 Histogram of N = 100 finite population values 
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Then a sample of size n = 20 values was taken from the finite 
population. The sample mean and standard deviation of the sampled 


values were y, = 9.438 and s, = 2.448. A histogram of the sample 
values is shown in Figure 11.14. 


Figure 11.14 Histogram of n = 20 sample values 
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Then a MC sample of size J = 1,000 was taken from the joint posterior 
distribution of 1; and 4 21/6^, i.e. from f(1,4|D) with D - (s, y,). 
Hence a MC sample of size J was obtained from the predictive 
distribution of v , namely y,,....w, » iid f(y |D). 


Note: As explained in previous exercises, this was done by doing the 
following for each j 21,...,J : 

* first sample y? ~ iid N (uj À,), ier 

* then form y =y y) 


e finally calculate y, from (y,, y”). 


The MC sample from y’s predictive distribution was used to calculate 
the point estimate 


J 
p= yy = 1:715 
J j=1 


(the MC estimate of y/s predictive mean) and the interval I = (1.456, 
2.078) formed by the empirical 0.025 and 0.975 quantiles of y,,...,.w,. 
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Figure 11.15 shows a probability histogram of the simulated values 
Wi- W; overlaid by an estimate of y^s predictive density f(w|D). 


Also shown are the Bayesian estimates (represented by three vertical 
lines), and the true value of y , which is 1.536 (represented by the dot). 


We note that the true value of y lies in the Bayesian interval estimate, 


and the difference between the Bayesian estimate and the true value is 
1.715 — 1.536 = 0.179. 


Figure 11.15 Inference on y based on a MC sample (J = 1,000) 
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(e) The procedure in (d) was repeated so as to yield a total of K = 100 
Bayesian estimates j,,..,U/, and K corresponding Bayesian interval 


estimates [;,...,],.. From these results we estimate the design mean of 
the Bayesian predictive mean estimate y by 


- 1€. 
y =—}_ 0, = 1.6168, 
Kia 
with 9596 CI (for that mean) 


= 1 K = 
7 +1.96, |-——— V (p, -7)° | = (1.5962, 1.6374). 
L xix 5 2^ V) | ( ) 


Hence we estimate the design bias of y by W-wy = 0.0808, with 95% 
CI (0.0602, 0.1014). Thus we also estimate the relative design bias of w 
by (V —w)/y - 5.3, with 9596 CI (3.9, 6.6) (96). 
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Also, 91 of the 100 Bayesian interval estimates I,,...,/,, contained the 
true value, y = 1.536. So we estimate the design coverage of the 95% 
CPDR estimate of y (based on a MC sample with size J = 1,000) as 


0.91, with 95% CI (0.91+1.96,/0.91(1 — 0.91) /100) = (0.8539, 0.9661). 


Figure 11.16 shows a probability histogram of the K simulated values 
V,,..., V, and related quantities. 


Figure 11.16 Distribution of K = 100 estimates of y 
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(f) Figure 11.17 is an analogue of Figure 11.16 but obtained by replacing 
the Monte Carlo sample mean estimate y=(y,+...tw,)/J by the 


empirical median of y^, ..., V, . 


Likewise, Figure 11.18 is an analogue of Figure 11.16 but obtained by 
replacing the posterior mean estimate by the empirical mode of 


Wi ssl y 
Note: The empirical mode was obtained using the R function density(). 


We see that the design bias of the empirical mode appears to be smaller 
than that of the empirical median, which in turn is smaller than that of 
the posterior mean. The biases of the Monte Carlo predictive mean, 
median and mode estimates (based on a Monte Carlo sample size of 
J = 1,000) are estimated as 75.396, +3.8% and +1.4%. 
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Note: From Figure 11.15 in (d) we may have already guessed that the 
posterior mode is better than the posterior mean as an estimate of y 


(whose true value is 1.536, as shown by the dot in Figures 11.15—18). 


Figure 11.17 Distribution of K = 100 estimates of v 
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Figure 11.18 Distribution of K = 100 estimates of y 
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R Code for Exercise 11.3 
# (a) 
X11(wz8,hz4); par(mfrow=c(1,1)); options(digits=4) 


N=100; n=20; mu=10; sig=2; lam=1/sig*2; gam=mu/sig 

set.seed(332); yzrnorm(N,mu,sig); # hist(y,prob=T) 

hist(y, prob=T,xlab="value", xlim=c(0,20), ylim=c(0,0.4), breaks=seq(0,20,0.5), 
mainz" ") 

lines(seq(0,20,0.1),dnorm(seq(0,20,0.1),mu,sig),Iwdz3) 


ys=y[1:n] 
hist(ys, prob=T,xlab="value", xlim=c(0,20), ylim=c(0,0.4), breaks=seq(0,20,0.5), 
main="") 


lines(seq(0,20,0.1),dnorm(seq(0,20,0.1),mu,sig),lwd=3) 


ysbar=mean(ys); sys=sd(ys); gammle=ysbar/sys 
ybar=mean(y); sy=sd(y); ygam=ybar/sy; c(ybar,sy,ygam) # 9.932 1.907 5.207 
c(lam,ysbar,sys, gam, gammle) # 0.250 10.516 1.749 5.000 6.011 


J=1000; set.seed(171); 

lamv=rgamma(J,(n-1)/2,sys*2*(n-1)/2); muv=rnorm(J,ysbar,1/sqrt((n*lamv))) 
gamvzmuv*sqrt(lamv) 

gambar=mean(gamv); gamint=quantile(gamv,c(0.025,0.975)) 

c(gambar, gamint) # 5.925 4.115 7.963 


hist(gamv, prob=T,xlab="gamma", xlim=c(2,10), ylim=c(0,0.5), 
breaks=seq(0,12,0.25), mainz" ") 

abline(v=c(gambar, gamint),lwd=3); lines(density(gamv),lwd=3) 

points(c(gam,gammle),c(0,0),pch=c(16,1)) 

legend(7,0.5,c("True value of gamma","MLE of gamma"), 
pch=c(16,1),bg="white") 


# (b) Follows on from (a) 


K = 100; J=1000; gambarvec=rep(NA,K); gammlevec=rep(NA,K); 
gamlie=rep(0,K); 
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set.seed(143); for(k in 1:K)( 
y=rnorm(N,mu,sig); s=1:n; ys=y[s]; ysbarzmean(ys); sys=sd(ys) 
lamv-rgamma(J,(n-1)/2,sys^2* (n-1)/2); 
muv=rnorm(J,ysbar,1/sqrt((n*lamv))) 
gamvzmuv*sqrt(lamv); gambarzmean(gamv); 
gammlevec([k]=ysbar/sys 
gamint=quantile(gamv,c(0.025,0.975));  gambarvec[k]=gambar 
if((gamint[1]<=gam)&&(gam<=gamint[2])) gamlie[k]=1 } 


Eestzmean(gambarvec); 
Eci-Eest4c(-1,1)*qnorm(0.975)*sd(gambarvec)/sqrt(K) 
Cest=mean(gamlie); Cci=Cest+c(-1,1)* qnorm(0.975)*sqrt(Cest*(1-Cest)/K) 
c(Eest,Eci,Cest,Cci) # 5.2226 4.9986 5.4466 0.9100 0.8539 0.9661 
Emleest=mean(gammlevec) 
Emleci=Emleest+c(-1,1)*qnorm(0.975)*sd(gammlevec)/sqrt(K) 
c(Emleest,Emleci) # 5.298 5.070 5.526 


Biasest=Eest-gam; Biasci=Eci-gam 
Biasmleest=Emleest-gam; Biasmleci=Emleci-gam 
c(Biasest,Biasci, Biasmleest,Biasmleci) 
# 0.222583 -0.001418 0.446583 0.298019 0.070493 0.525544 
c(Biasest,Biasci, Biasmleest,Biasmleci)/gam 
4 0.0445165 -0.0002836 0.0893166 0.0596037 0.0140986 0.1051088 


# hist(gambarvec,prob-T) 

hist(gambarvec,prob=T,xlab="gammabar, gammahat", xlim=c(2,12), 
ylim=c(0,0.6), breaks=seq(0,12,0.5), mainz "") 

abline(v=c(Eest,Eci), Ityz1, lwd=3); abline(v=c(Emleest,Emleci), Ityz2, lwd=3) 

lines(density(gambarvec),lty=1,lwd=3); lines(density(gammlevec), Ity=2,lwd=3) 

points(gam,0,pch=16) 

legend(6.5,0.6,c("Bayesian estimates \n(MC with J=1000)", "ML estimates"), 
Ity=c(1,2), lwd=c(3,3)) 


# (c) 


K = 5000; J21000; gambarvec=rep(NA,K); 
gammlevec=rep(NA,K); gamlie=rep(0,K); 
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set.seed(213); for(k in 1:K)( # Takes a few seconds 
y=rnorm(N,mu,sig); s=1:n; ys=y[s]; ysbarzmean(ys); sys=sd(ys) 
lamv=rgamma(J,(n-1)/2,sys*2*(n-1)/2); 
muv=rnorm(J,ysbar,1/sqrt((n*lamv))) 
gamv=muv*sart(lamv); 
gambar=mean(gamv); gammlevec[k]-ysbar/sys 
gamint=quantile(gamv,c(0.025,0.975)); gambarvec[k]2gambar 
if((gamint[1]<=gam)&&(gam<=gamint[2])) gamlie[k]=1 } 


Eestzmean(gambarvec); 
Eci-Eest4c(-1,1)*qnorm(0.975)*sd(gambarvec)/sqrt(K) 
Cest=mean(gamlie); Cci2Cest4c(-1,1)*qnorm(0.975)*sqrt(Cest*(1-Cest)/K) 
c(Eest,Eci,Cest,Cci) # 5.162 5.136 5.187 0.951 0.945 0.957 
Emleest=mean(gammlevec) 
Emleci=Emleest+c(-1,1)*qnorm(0.975)*sd(gammlevec)/sqrt(K) 
c(Emleest,Emleci) # 5.230 5.204 5.256 


Biasest=Eest-gam; Biasci=Eci-gam 
Biasmleest=Emleest-gam; Biasmleci=Emleci-gam 
c(Biasest,Biasci, Biasmleest,Biasmleci) 

# 0.1616 0.1359 0.1872 0.2301 0.2041 0.2561 
c(Biasest,Biasci, Biasmleest,Biasmleci)/gam 

# 0.03231 0.02718 0.03745 0.04602 0.04081 0.05122 


# hist(gambarvec,prob=T) 

hist(gambarvec,prob=T,xlab="gammabar, gammahat", xlim=c(2,12), 
ylim=c(0,0.6), breaks=seq(2,12,0.25), main= "") 

abline(v=c(Eest,Eci), Ityz1, lwd=3); abline(v=c(Emleest,Emleci), Ityz2, lwd=3) 

lines(density(gambarvec),lty=1,lwd=3); lines(density(gammlevec), Ity=2,lwd=3) 

points(gam,0,pch=16) 

legend(6,0.6,c("Bayesian estimates \n(MC with J=1000)", "ML estimates"), 
Ity=c(1,2), lwd=c(3,3)) 


hist(gambarvec,prob=T,xlab="gammabar, gammahat", xlim=c(4.5,6), 
ylim=c(0,0.6), breaks=seq(2,12,0.25), main= "") 

abline(v=c(Eest,Eci), Ityz1, lwd=3); abline(v=c(Emleest,Emleci), Ityz2, lwd=3) 

lines(density(gambarvec),lty=1,lwd=3); lines(density(gammlevec), Ity=2,lwd=3) 

points(gam,0,pch=16) 
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# (d) 


psifun=function(y){ max(y)/median(y) ) & Function for the quantity of interest 
N=100; n=20; mu=10; sig=2; set.seed(119); yzrnorm(N,mu,sig) 
ybar=mean(y); sy=sd(y); psi=psifun(y=y) 
c(ybar,sy,min(y),max(y), median(y), psi) 
#9.675 2.159 3.678 15.622 10.171 1.536 


hist(y,prob=T,xlab="value", xlim=c(0,20), ylim=c(0,0.4), breaks=seq(0,20,0.5), 
main="") 
lines(seq(0,20,0.1),dnorm(seq(0,20,0.1),mu,sig),lwd=3) 


set.seed(421); ys=sample(y,n) 
ys-y[s]; ysbar=mean(ys); sy=sd(ys); sy2=var(ys) 
c(ysbar,sy, sy2) # 9.438 2.448 5.994 


hist(ys,probzT,xlabz"value", xlim=c(0,20), ylim=c(0,0.4), breaks=seq(0,20,0.5), 
main="") 
lines(seq(0,20,0.1),dnorm(seq(0,20,0.1),mu,sig),lwd=3) 


set.seed(323); J=1000; 

lamv=rgamma(J,(n-1)/2,sy2*(n-1)/2); muv=rnorm(J,ysbar,1/sqrt((n*lamv))) 

psiv=rep(NA,J); 

for(j in 1:J){ yrsim=rnorm(N-n,muv,1/sqrt(lamv)); ysim=c(ys,yrsim); 
psiv[j]=psifun(y=ysim) } 


psibar=mean(psiv); psiintzquantile(psiv,c(0.025,0.975)) 
c(psibar,psiint) # 1.715 1.456 2.078 

summary(psiv) 

4 Min. Ist Qu. Median Mean 3rd Qu. Max. 

H 1.37 1.60 1.69 1.72 1.81 2.34 


H hist(psiv, prob=T) 

hist(psiv, prob=T,xlab="psi", xlim=c(1.3,2.4), ylim=c(0,4),breaks=seq(1,2.5,0.05), 
main="") 

abline(v=c(psibar, psiint),lwd=3); den=density(psiv) 

lines(den,lwd=3); points(psi,0, pch=16) 


psimedian=median(psiv) 


psimode=denSx[denSy==max(denSy)] 
c(psibar,psimedian,psimode) # 1.715 1.688 1.659 
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# (e) Follows on from (d) 


K = 100; J=1000; psibarvec=rep(NA,K); LBvec= psibarvec; UBvec=LBvec; 
alp=0.05 
set.seed(411); 


date() # 
for(k in 1:K){ 
ys=sample(y,n); ysbar=mean(ys); sy2=var(ys) 
lamv=rgamma(J,(n-1)/2,sy2*(n-1)/2); 
muv=rnorm(J,ysbar,1/sqrt((n*lamv))) 
psiv=rep(NA,J); for(j in 1:J){ 
yrsim=rnorm(N-n,muv,1/sqrt(lamv)) 
ysim=c(ys, yrsim) 
psiv[j]=psifun(y=ysim) 
} 
psibarvec[k] = mean(psiv); 
LBvec[k]=quantile(psiv,alp/2); UBvec[k]=quantile(psiv,1-alp/2) 
} 
date()# Simulation with K=100 & J=1000 takes 12 seconds 


ct=0; for(k in 1:K) if((LBvec[k]<=psi)&&(psi<=UBvec[k])) ct=ct+1 


# hist(psibarvec, prob=T) 

hist(psibarvec,prob=T,xlab="psibar", xlim=c(1.2,2), ylim=c(0,6.5), 
breaks=seq(1.2,2,0.025), mainz "") 

points(psi,O,pchz16) 


# Characteristics of posterior mean estimate -------------- 
Eest=mean(psibarvec); Eci=Eest+c(-1,1)* qnorm(0.975)*sd(psibarvec)/sqrt(K) 
Cest=ct/K; Cci=Cest+c(-1,1)*qnorm(0.975)*sqrt(Cest*(1-Cest)/K) 
c(Eest,Eci,Cest,Cci) # 1.6168 1.5962 1.6374 0.9100 0.8539 0.9661 
Biasest=Eest-psi; Biasci=Eci-psi; c(Biasest,Biasci) # 0.08084 0.06024 0.10144 
c(Biasest,Biasci)/psi # 0.05263 0.03922 0.06604 

abline(v=c(Eest,Eci), Ityz1, Iwdz3); lines(density(psibarvec),lty=1,lwd=3) 


# (f) Follows on from (e) 


K = 100; J21000; LBvec= rep(NA,K); UBvec=LBvec; alp=0.05 
psimodevec= LBvec; psimedianvec= LBvec; set.seed(411); 


date() # 
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for(k in 1:K){ 
ys=sample(y,n); ysbarzmean(ys); sy2=var(ys) 
lamv=rgammai(J,(n-1)/2,sy2*(n-1)/2); 
muv=rnorm(J,ysbar,1/sqrt((n*lamv))) 
psiv=rep(NA,J); for(j in 1:J){ 
yrsim=rnorm(N-n,muv,1/sqrt(lamv)) 
ysim=c(ys,yrsim) 
psiv[j]=psifun(y=ysim) 
} 
psimedianvec[k] = median(psiv) 
den=density(psiv); psimodevec[k]=den$x[denSy==max(denSy)] 
LBvec[k]=quantile(psiv,alp/2); UBvec[k]=quantile(psiv,1-alp/2) 
} 
date() # Simulation with K=100 & J=1000 takes 12 seconds 
ct=0; for(k in 1:K) if((LBvec[k]<=psi)&&(psi<=UBvec[k])) ct=ct+1 


# hist(psimedianvec, prob=T) 
hist(psimedianvec,prob=T,xlab="psimedian", xlim=c(1.2,2), 

ylim=c(0,6),breaks=seq(1.2,2,0.025), mainz "") 
points(psi,O,pchz16) 


# Characteristics of posterior median estimate ----------------- 
Eest=mean(psimedianvec); 
Eci=Eest+c(-1,1)*qnorm(0.975)*sd(psibarvec)/sqrt(K) 

Cest=ct/K; Cci=Cest+c(-1,1)*qnorm(0.975)*sqrt(Cest*(1-Cest)/K) 
c(Eest,Eci,Cest,Cci) # 1.5947 1.5741 1.6153 0.9100 0.8539 0.9661 
Biasest-Eest-psi; Biasci=Eci-psi; c(Biasest,Biasci) # 0.05873 0.03813 0.07934 
c(Biasest,Biasci)/psi # 0.03824 0.02483 0.05165 

abline(v=c(Eest,Eci), Ityz1, lwd=3); lines(density(psimedianvec),Ityz1,Iwdz3) 


# hist(psimodevec,prob-T) 
hist(psimodevec,prob=T,xlab="psimode", xlim=c(1.2,2), 

ylim=c(0,6),breaks=seq(1.2,2,0.025), mainz "") 
points(psi,O,pch-16) 


# Characteristics of posterior mode estimate -------------------- 
Eest=mean(psimodevec); Eci=Eest+c(-1,1)*qnorm(0.975)*sd(psibarvec)/sqrt(K) 
Cest=ct/K; Cci=Cest+c(-1,1)*qnorm(0.975)*sqrt(Cest*(1-Cest)/K) 
c(Eest,Eci,Cest,Cci) # 1.5579 1.5373 1.5785 0.9100 0.8539 0.9661 
Biasest=Eest-psi; Biasci=Eci-psi; c(Biasest, Biasci) 

# 0.021933 0.001332 0.042534 
c(Biasest,Biasci)/psi #0.0142795 0.0008672 0.0276917 
abline(v=c(Eest,Eci), Ityz1, lwd=3); lines(density(psimodevec),|ty=1,lwd=3) 
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12.1 Review of sampling mechanisms 


We have already discussed the topic of ignorable and nonignorable 
sampling in the context of Bayesian finite population models. To be 
definite, let us now focus on the model defined by: 


f(s|y,@) ^ (the probability of obtaining sample s for given 
values of y and 8) 
f(y|@) (the model density of the finite population vector) 
f (0) (the prior density of the parameter), 
where the data is D —(s,y,) and the quantity of interest is some 
functional W = g(0, y), e.g. a function of two components of @ or a 
function of y only, etc. 


We say that the sampling mechanism is ignorable if 
f(v1s.y.)- f(vly.) 
for all values of w , where s is fixed at its observed value, or 
equivalently, if the posterior distribution of v is exactly the same when 
it is calculated solely on the basis of the ‘reduced model’ as given by: 
f(y|@) (same as before) 


f (0) (same as before), 
that is, with f(s|y,@) effectively being ‘ignored’. Otherwise, we say 
that the sampling mechanism is nonignorable (or biased). 


Equivalently, the sampling mechanism is ignorable if 


fwls.yJ= FY ly.) 
for all v , and the sampling mechanism is nonignorable if 


fls, y) fiy.) 


for at least one value of y . 


Recall that in some situations, whether the sampling mechanism is 
ignorable may depend on which particular units happen to be sampled. 
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For example if f(s|y,0) is a function of only N, n and y, (say), then 


(typically) the sampling mechanism is ignorable if and only if unit 3 is 
sampled (and thereby observed). 


Also, recall that analogous definitions apply if the sampling mechanism 
is alternatively specified in terms of 


f (1]y,0) 
or in terms of 
f(L|y,0), 


rather than in terms of 


f (s| y, 0). 


Here, as previously, 1—(1,..,1,) denotes the vector of inclusion 
counters, i.e. the numbers of times units 1,...,N are sampled (possibly 
more than once in the case of sampling with replacement), and 
L -(L,..,L,) is the vector of the labels of the units sampled in the 
temporal order in which they are sampled. 


12.2 Nonresponse mechanisms 


An issue related to nonignorable sampling is nonignorable nonresponse. 
Once a sample has been taken, some of the units may then fail to 
respond. This may be for whatever reason, but the underlying issue is 
that the values of the nonresponding units will then be unobserved, with 
possibly serious consequences to the resulting inference. 


This issue can be addressed by introducing another variable and level 
into the modelling equation. Let R, denote the ith response indicator, 
meaning the indicator variable for the ith population unit responding. 


Thus R, = 1 if unit i responds, and R, = 0 otherwise (i —1,..., N ). 


Now let R-(R,..R,) (or the transpose of this) be called the 


population response vector, and likewise, define: 
R, — (R, ,..., R, ) as the sample response vector 


R, z (R,,.., R,..) as the nonsample response vector. 


T 
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With these definitions we may now augment our ‘base model’ above 
with a new level in the hierarchy, typically in between y and s, as 
follows: 


f(s|R,y,@) (the probability of obtaining sample s for given 
values of R, y and 0) 

f(R|y,@) (the probability of units responding as indicated 
by R, given y and 0) 

f(y10) (same as before) 

f (0) (same as before). (12.1) 


Note 1: This general formulation, with f(s|R,y,@) a function of R, 


means that which units are sampled could potentially depend on which 
units would respond if sampled. However, typically it will be 
reasonable to assume that the sampling and response mechanisms are 


independent in the model, meaning that f (s |R, y, 0) f (s| y,0). 


Note 2: The statistical literature contains many different and 
sometimes inconsistent treatments of nonignorable nonresponse. For a 
review of the term ‘missing at random’, which relates to but does not 
feature in the exposition here, see Seaman et al. (2013). 


In the context of this model, let 

n, =R, t..* R, -LR, 
be the number of values in the sample that respond (have a value that is 
observed), and let 

n, -n-n, 
be the number of units in the sample that do not respond (have a value 
that is unobserved). 


Then define 

0 —(0,,...,0, ) 
as the observed vector (the vector of the labels of the units sampled and 
observed), and define 

u = (u,,...,U, ) 
as the unobserved vector (the vector of the labels of the units sampled 
and unobserved). 
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Note: In each of these vectors, the values (labels) are assumed to be in 
increasing order. 


Then define the observed sample vector as 
Yo =a Yo, J 

and the unobserved sample vector as 
Ya 5 Yuso Yu) 


With these definitions, the data has the general form 


D=(s,R,, y,) 
and also the quantity of interest has the general form 
y — g(0, y, R). 


Note 1: The function g defining y takes into account the possibility 


there may be interest in whether some of the nonsampled units would 
have responded had they been sampled. 


Note 2: As mentioned previously, it is often convenient to re-label the 
N finite population values in such a way that 


y -(y, Y.) - (Yos Yo Yr) 
= (Qni Yn as Yn) ui 
x (uv) . 


In the context of the general four-level Bayesian finite population model 
given by (12.1) above (which involves s, R, y and 0), we may make the 
following definitions: 


* The sampling mechanism is ignorable if 


f (v | S, RY) = f (v | R, Y,) Vy 
with s fixed at its observed value (note that o is a function of s 
and R,); otherwise the sampling mechanism is nonignorable. 


* The response mechanism is ignorable if 


FU Is Ry.) = fls, y) Vv 
with o fixed at its observed value; otherwise the response 
mechanism is nonignorable. 
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These two basic definitions then lead to four general cases, defined as 
follows: 


The sampling mechanism and response mechanism are both 
ignorable if 


f (v | SR, y,)- f (v | y,) Vy 
with o fixed at its observed values. 


The sampling mechanism is ignorable and the response 
mechanism is nonignorable if 


fls Ry.) = fiv RY) Vv 
with s fixed at its observed value, and 


f(v |R.y* FWY.) 
for at least one value of y . 


The response mechanism is ignorable and the sampling 
mechanism is nonignorable if 


fU IS, R, Yy) = fis, y) Vv 
with o fixed at its observed value and 


f(vIs.y,))* f |y) 
for at least one value of y. 


The sampling mechanism and response mechanism are both 
nonignorable if 


f (v | S, RY) # f (v | Ry) 
for at least one value of yw and 


f (v |s, Ry.) * f(v Is. Yy.) 


for at least one value of y . 
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Exercise 12.1 A model with sampling and response 
mechanisms that are both ignorable 


Consider a Bayesian finite population model defined by: 


f (s| R,y,0) 
f(R| y,0) 
f (y|0) 
f(0), 


where the data is 
D z (s, R,, y,) 


and the quantity of interest is 


N 
y-g(0y,R)-l,y-Y,y -y, (finite population total). 


i-l 
Suppose that in this context: 


* the sample of n units is taken from the N in the population via 
SRSWOR 


* each unit in the population has the same probability of response, 
m 


e the population values in the model are iid, each with a 
distribution which depends only on a single parameter u 


* the model parameter vector is 
0 —- (uz) 


with u L7 (thus the two model parameters are independent, a 
priori). 


Show that the sampling mechanism and response mechanism are both 
ignorable, and that this is true for all possible values of the data. 
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Solution to Exercise 12.1 


Observe that for all s, R, y and 0: 


N =i 
f(s |R, y,8) = re- Cr) 


f(G1y,6)- f()- [[z^a- 2^ 


Jr = Yor t Yur + Yer » 

where: 
y. sy e > y; is the total of the observed sample values 
Yar =L Yy, = Y; y; is the total of the unobserved sample values 


Yr- Ly = y; y; is the total of the nonsample values. 


ier 
Note: Here, 1; denotes a column vector of n, ones, etc. 


Consequently, the relevant predictive density of the quantity of interest, 
namely 


f (w|D)7 f(r s RY), 


is derived from the joint predictive density of all unobserved and 
nonsampled values, namely 


f Quy, IS, R> Yo). 


We will now proceed to show that 


fy, [SR Yo) = fY Y, lY.) 
with o fixed at its observed value, and that this is true for all possible 
values of y,, y,,s, R, and y,. 


If this can be shown then also 
fs RY) = filv), 
for all possible values of y,,s, R, and y,. 


It will thereby be established that the sampling mechanism and response 
mechanism are both ignorable, and that this is true for all possible values 


of the data D = (s, R,, y,). 
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Observe that for any y,, y,, s, R, and y,, it is true that 


f Quy, | s, R, y) oc f Qu Y S Rs Yo) 


29340] f Qr, yo SR, Yoo R, Ms 7)d ud z 
R, 


- M PFO F.1D FO Ye |y.) 
xf (R,| 2) f(R, |7) f(s\duda 


= f(s)x| foy ly) | fao foL da 
| | FFR, Y fR, prd 
where [* l=] f (z,R,)x1dz = f(R,) 


n FUD f y, |) 
oc 1x clin yz) da 
| fowy nao fy.) | u 


=f fus Y- | 4Y) f (uly du 
= [fO ys Hl Ydy 


= fy A 
That is, 
POLY, IS. R y) = T OG DA, 
as required. 
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Exercise 12.2 An ignorable sampling mechanism with a 
nonignorable response mechanism 


A finite population consists of N = 500 values that are modelled as 
normally distributed with unknown mean „u and unknown variance 


o^ 21/4. A sample of size n = 100 is taken from this population via 
SRSWOR. We find that only n, = 34 values are observed, with values: 


12.57, 13.35, 11.47, 14.81, 13.25, 14.09, 11.55, 11.32, 13.2, 11.28, 
9.7, 12.18, 11.49, 10.52, 9.93, 11.84, 12.2, 10.57, 11.9, 14.75, 
10.34, 14.37, 12.13, 8.56, 11.91, — 11.79, 11.45, 14.98, 10.57, 12.28, 
9.91, 10.94, 13.28, 11.43. 


(a) Assuming that the response mechanism is ignorable, estimate the 
finite population mean. 


(b) A follow-up sample of size n, = 15 is taken from the n, = 66 non- 
responding units via SRSWOR, and these n, units are observed (by 


‘force’), yielding the values: 
5.4, 9.41, 7.03, 8.88, 11.47, 7, 9.44, 8.58, 9.27, 8.18, 
8.62, 8.73, 7.33, 9.81, 9.88. 


Thus there remain n — n, - n, =n, - n, = 51 nonresponding sample units 


with unknown values. 


Assuming that the response mechanism is ignorable, use all of the 
available data to re-estimate the finite population mean. 


(c) Repeat (b) but using a suitable Bayesian model which takes into 
account the response mechanism and appropriately incorporates it into 
the inferential procedure. 


Solution to Exercise 12.2 
(a) We estimate y by the average of the n, = 34 observed values, which 


is y, = 11.94. The sample standard deviation of these n, values is equal 
to S, — 1.552. So a 9596 CPDR for y is 


A S, n, m 
tus 1) Um j| | - (11.42, 12.46). 
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(b) We estimate y by the average of all n, =n, +n; =34+15=49 
observed values, which is equal to y,, = 10.92. The sample standard 


deviation of these n,, values is s,, = 2.168. So a 95% CPDR for y is 


S 
= of 
Yof t focos (n, 1) 1 
| Wop 


(c) Figure 12.1 is a histograms of the n, = 34 initially observed values 


m 
= (10.33, 11.51). 
N 


and the n, = 15 follow-up values, respectively. We see that the ‘forced’ 


follow-up values which initially failed to respond seem to be smaller on 
average than the values of the units which responded. This suggests a 
biased or nonignorable nonresponse mechanism whereby units with 
large values are more likely to respond than units with small values. 


Figure 12.1 Initially observed and and follow-up sample values 
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One way (amongst several) to model such a response mechanism is via 
the formulation 


(R, | y, 44, 4) ^L Bernoulli(p,), i=1,...,N, 


P; 
lo COMME. -. 
(5 


i 


where 


Jem 


is the logit of the probability of unit i responding. 


568 


Chapter 12: Biased Sampling and Nonresponse 


Noting that the sampling mechanism is ignorable, and that the response 
mechanism would be ignorable if all n sample values were known, we 
posit a suitable Bayesian model as follows: 


" _ 1 

(y LY Ro Y 454) ~ Cy IY Y) m rat Ya) 
N-n 

(Y |R, Y, 44) ~ (Yr | 44, A) ii v(e- ma t) 


F(R, ly.44)= F(R, 1y,)= ] | 59a - 5^ 


ies 


where p; = lig 50 


Avy, -uy 


f(y, 1454) = Ie 


les 


f(4,4)*«1/4, ue9, 450. 


Note: There is no need to include the nonsample response vector R, in 
the model. 


Let m=s—o-f =u- f be the vector of labels for the units which are 
sampled but still ‘missing’ after the follow-up sample has been 
observed. 


Then the joint posterior/predictive density of all the relevant unknowns 
in the model may be written 


for HAs Ym | Ry, y) < V dh Ym Ro Yo Yp) 


= fD fo |4 A) f y; 1 A) f Om na] 
V TOS LY O TUR LO TCR, z f Qs |44) 


iJ oa =u)? "E QU -Ay i. QU 


ieo er NOE iem 
m pi - p, es p a-p) [[»a-p)^ 
ieo ief iem 


JA a gO N 
X—————T— € . 
VN —n«y2z 
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This joint density defines a suitable Metropolis-Hastings algorithm with 
Gibbs steps that could be run to obtain a Monte Carlo sample from the 
predictive distribution of the finite population mean y. 


One way to proceed is to implement this algorithm using WinBUGS and 
the code shown below (underneath the R Code below). Some of the 
results are as shown in Table 12.1. These inferences are based on 
J =10,000 iterations of a WinBUGS run, following an initial burn-in of 
size 1,000. 


Table 12.1 Results of WinBUGS analysis 


node mean sd MC error 2.5% median 97.5% 
a -17.86 4.582 0.4184 -26.96  -17.79 -10.31 
b 1.676 0.4535 0.04136 0.9301 1.672 2.586 
lam 0.1921 0.04236 0.001112 0.118 0.189 0.2828 
mu 9.688 0.3508 0.01358 8.976 9.693 10.35 
ps[1] 0.9348 0.07378 0.006256 0.7572 0.959 0.997 
ps[2] 0.9721 0.0535 0.004619 0.8664 0.9886 0.9996 
ps[99] 0.1417 0.2097 0.003545 1.224E-5 0.04017 0.7787 
ps[100] 0.1423 0.2101 0.003883 1.12E-5 0.03954 0.7731 
ybar 9.687 0.3353 0.01329 9.013 9.696 10.32 
yrT 3874.0 147.9 5.408 3573.0 3878.0 4156.0 


From Table 12.1, we estimate the posterior mean of y by 9.69 and we 
estimate the 9596 CPDR for y as (9.01, 10.32). It will be noted that this 


inference is significantly lower than the inferences in (a) and (b) where 
the response mechanism was taken as ignorable. Some of the graphical 
output from the WinBUGS run are shown in Figure 12.2 


Figure 12.2 Graphical output from WinBUGS 


ybar sample: 10000 
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ybar 
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Discussion 


It is instructive to now reveal that the data values in this exercise were in 
fact generated as follows. 


First, a finite population of size N = 500 was generated from the normal 
distribution with mean 4 = 10 and standard deviation o = 2. The mean 
of the finite population values was calculated as y = 10.10. 


Note: We see that the CPDR in (c), (9.013, 10.32), contains this true 
value of y, whereas the CPDRs in (a) and (b), (11.42, 12.46) and 


(10.33, 11.51), do not. This suggests the analysis in (c) was on the 
right track. 


Then a random sample of size n = 100 was taken from the finite 
population according to SRSWOR. The sample mean was calculated as 


y, - 9.91. 


Note: Thus, if there had been no nonresponse then the finite population 
mean (with true value 10.10) would have been estimated by 9.91. 


Figure 12.3 shows histograms of the population and sample values, each 


overlaid by the superpopulation density. The dots in the two subplots 
show y - 10.10 and y, - 9.91, respectively. 
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Figure 12.3 Histograms of the population and sample values 
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Then the probabilities of response were calculated as 
1 
Pi 7 iig 


with a = —15 and b = 1.4 (set in advance). 


Using these probabilities, it was next determined which units would 
respond, by sampling 

R, ~ Bernoulli( p;) 
for each i = 1,...,N. 


Thereby it was established which sample units would respond and which 
would not. Figure 12.4 shows histograms of these two groups (of size 


n, = 34 and n, = 66), each overlaid by the superpopulation density. The 
dots in the left and right subplots show y, and y,, respectively, and 
each histogram is overlaid by the superpopulation density. 


We see how the respondent values are systematically larger than the 


nonrespondent values. This reflects the fact that units with larger values 
were more likely to respond. 
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Figure 12.4 Observed and unobserved (non-responding) 
sample values 
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Figure 12.5 shows all N probabilities of response P,,...,P, plotted 


against the population values y;,..., y, . The crosses indicate population 


units which would not respond if sampled, and these naturally tend to be 
the units with the smallest values. 


Figure 12.5 Probabilities of response in the population 


00 02 04 06 08 10 


Likewise, Figure 12.6 shows the n probabilities of response in the 
sample plotted against the sample values. The crosses indicate sample 
units which did not respond in actuality, and these tend to be the units 
with the smallest values. The solid dots indicate the 15 units which were 
selected for ‘forced’ follow-up according to SRSWOR (from the 66 non- 
responding sample units). Without these 15 'representative' follow-up 
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values it would have been impossible to appropriately address the 
nonignorable nonresponse problem and correct the biased inference in 
(a) and (b) downward. 


Figure 12.6 Probabilities of response in the sample 
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R Code for Exercise 12.2 
# Preliminary: Data generation and description =========== 


X11(w=8,h=4); par(mfrow-c(1,1); ^ options(digits=4); 
N=500; n=100; mu=10; sig=2; a--15; b-1.4; 

set.seed(421); yzrnorm(N,mu,sig) # N finite population values 
p=1/(1+exp(-(a+b*y))) # N probabilities of response (logistic) 
plot(y,p) # OK 


set.seed(123); R=rbinom(N,1,p) # N response indicators 
set.seed(421); szsort(sample(1:N,n) ^ & n sample labels 


r = (1:N)[-s] # N-n nonsample labels 


ys=y[s] # n sample values 

yr=y[r] # N-n nonsample values 

Rs = R[s] # n sample response indicators 

Rr = R[r] # N-n nonsample response indicators 


no = sum(Rs); nu = n-no; c(no,nu) 

# 3466 numbers of observed and unobserved units 
o = s[Rs==1] # labels of observed sample values 
u = s[Rs==0] # labels of unobserved sample values 
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rbind(s[1:10],Rs[1:10]) 

H[1] 6 7 14 17 22 37 39 48 66 69 
#[2] 0O 01.010001 1 
o[1:5] 4 14 22 66 69 78 Correct 

u[1:5] #6 7 17 37 39 Correct 


yo = y[o]; yu = y[u] 

ybarzmean(y); ysbarzmean(ys); yrbarzmean(yr); 

yobarzmean(yo); yubarzmean(yu) 

c(ybar,ysbar,yrbar,yobar,yubar) # 10.095 9.907 10.143 11.938 8.860 


4 Plot population and sample values ------------------------------- 

par(mfrowzc(1,2)) 

hist(y,probzT,xlabz"value", mainz" Population", 
xlimzc(3,17),ylimzc(0,0.25), breaks=seq(0,20,1)) 

lines(seq(0,20,0.1),dnorm(seq(0,20,0.1),mu,sig),Iwdz3) 

points(ybar,0,pch=16) 

hist(ys,probzT,xlabz"value", main="Sample", 
xlimzc(3,17),ylimzc(0,0.25), breaks=seq(0,20,1)) 

lines(seq(0,20,0.1),dnorm(seq(0,20,0.1),mu,sig),Iwdz3) 

points(ysbar,0,pch=16) 


# Plot observed and unobserved sample values ------------------------------- 

par(mfrow=c(1,2)) 

hist(yo,prob=T,xlab="value", main="Observed", 
xlim=c(3,17),ylim=c(0,0.35), breaks=seq(0,20,1)) 

lines(seq(0,20,0.1),dnorm(seq(0,20,0.1),mu,sig),lwd=3) 

points(yobar,0,pch=16) 

hist(yu,prob=T,xlab="value", main="Unobserved", 
xlimzc(3,17),ylimzc(0,0.35), breaks=seq(0,20,1)) 

lines(seq(0,20,0.1),dnorm(seq(0,20,0.1),mu,sig),lwd=3) 

points(yubar,0,pch=16) 


# Plot probabilities of response in population -------------- 
par(mfrow=c(1,1)) 

plot(y,p,xlab="y",ylab="p",main="") 

points(y[R==0], p[R==0],pch=4,cex=1.5) 

text(8,0.8,"The crosses represent nonrespondents") 


# Plot probabilities of response in sample and follow-up subsample -------------- 


par(mfrow=c(1,1)); plot(ys,p[s],xlab="y",ylab="p",main="") 
points(ys[Rs==0], p[s][Rs==0],pch=4,cex=1.5) 
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nf=15; set.seed(112); followup = sort(sample(1:nu, nf)) # Follow up sample 
f=u[followup] # pop. labels of follow-up units 

yf=y[f] & The follow-up sample vector 

yfbarzmean(yf); yfbar $4 8.601 mean of follow-up values 

points(yf, p[f], pch=16) # OK 

text(8,0.8,"The crosses represent nonrespondents") 

text(8,0.7,"The dots represent follow-up units") 


# Print data -------------------------------------------------- 

s H[1] 6 7 14 17 22 37 39 48 66 69 73 77 78 103 105 106 117........ 
o H [1] 14 22 66 69 78 141 152 156 172 228 230 232 ...... 

f # [1] 17 73 77 128 145 163 187 196 253 271 318 357 436 438 481 


paste(as.character(round(yo, 2)), collapsez", ") 

# 12.57, 13.35, 11.47, 14.81, 13.25, 14.09, 11.55, 11.32,13.2,11.28,9.7,12.18, 
# 11.49, 10.52, 9.93, 11.84, 12.2, 10.57, 11.9, 14.75, 10.34, 14.37, 12.13, 8.56, 
8 11.91, 11.79, 11.45, 14.98, 10.57, 12.28, 9.91, 10.94, 13.28, 11.43 
paste(as.character(round(yf,2)), collapsez", ") 

# 5.4, 9.41, 7.03, 8.88, 11.47, 7, 9.44, 8.58,9.27,8.18,8.62,8.73,7.33,9.81, 9.88 


yo = c(12.57, 13.35, 11.47, 14.81, 13.25, 14.09, 11.55, 11.32, 13.2, 11.28, 9.7, 
12.18, 11.49, 10.52, 9.93, 11.84, 12.2, 10.57, 11.9, 14.75, 10.34, 14.37, 12.13, 
8.56, 11.91, 11.79, 11.45, 14.98, 10.57, 12.28, 9.91, 10.94, 13.28, 11.43) 
no=length(yo); N=500; ybarhata = mean(yo); so=sd(yo) 
ybarcpdra=ybarhata+c(-1,1)*qt(0.975,no-1)*(so/sqrt(no))*sqrt(1-no/N) 
c(no,so,ybarhata, ybarcpdra) # 34.000 1.552 11.939 11.416 12.461 


yf = c(5.4,9.41,7.03,8.88,11.47,7,9.44,8.58,9.27,8.18,8.62,8.73, 7.33, 9.81,9.88) 
yof=c(yo,yf); nof=no+nf; ybarhatb = mean(yof);sof=sd(yof) 
ybarcpdrb=ybarhatb+c(-1,1)*qt(0.975,nof-1)*(sof/sqrt(nof))*sqrt(1-nof/N) 
c(nof,sof,ybarhatb, ybarcpdrb) # 49.000 2.168 10.917 10.326 11.509 


# Plot observed and follow-up sample values separately 

par(mfrow=c(1,2)) 

hist(yo, prob=T,xlab="value", main="Initially observed", 
xlim=c(3,17),ylim=c(0,0.35), breaks=seq(0,20,1)); 

points(mean(yo),0,pch=16); 

hist(yf, prob=T,xlab="value", main="Follow-up", 
xlim=c(3,17),ylim=c(0,0.35), breaks=seq(0,20,1)); 

points(mean(yf),0, pch=16) 
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WinBUGS code for Exercise 12.2 


model 
{ 
for(i in 1:n){ 
zs[i] «- a + b*ys[i] 
logit(ps[i])«- zs[i] 
rs[i] ~ dbern(ps[i]) 
ys[i] ~ dnorm(mu,lam) 
} 
a ~ dnorm(0.0,0.001) 
b ~ dnorm(0.0,0.001) 
mu ~ dnorm(0.0,0.001) 
lam ~ dgamma(0.001,0.001) 
ysT <- sum(ys[]) 
meanyrT «- nr*mu 
precyrT «- lam/nr 
yrT ~ dnorm(meanyrT,precyrT) 
ybar <- (ysT+yrT)/(n+nr) 
} 


# data 
list( n=100, nr=400, 
rs=c(  1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1, 
1,1,1,1,1,1,1,1,1,1, 1,1,1,1,0,0,0,0,0,0, 
0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0, 
0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0, 
0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0), 


ys=c( 

12.57, 13.35, 11.47, 14.81, 13.25, 14.09, 11.55, 11.32, 13.2, 11.28, 
9.7, 12.18, 11.49, 10.52, 9.93, 11.84, 12.2, 10.57, 11.9, 14.75, 
10.34, 14.37, 12.13, 8.56, 11.91, 11.79, 11.45, 14.98, 10.57, 12.28, 
9.91, 10.94, 13.28, 11.43, 5.4, 9.41, 7.03, 8.88, 11.47, 7, 

9.44, 8.58, 9.27, 8.18, 8.62, 8.73, 7.33, 9.81, 9.88, NA, 

NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 

NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 

NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 

NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 

NA, NA, NA, NA, NA, NA, NA, NA, NA, NA) ) 

# inits 


list(a=0,b=0,mu=0,lam=1) 
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12.3 Selection bias in volunteer surveys 


Volunteer surveys are common nowadays, with the main mediums being 
the telephone and Internet. However, they can be misleading on account 
of selection bias, and this has been known for a long time. For example, 
in 1983 a major television network in the US conducted a phone-in (or 
dial-in) poll. Viewers were invited to phone the network and answer the 
following question: 

Should the United Nations continue to be based in the United States? 


Of the 185,000 phones calls subsequently registered, 3396 were from 
persons answering yes, and 6796 from persons answering no. The 
question then arose as to how reliable these figures are when applied to 
the American population as a whole. Many factors could affect said 
reliability, for example whether some people phoned in more than once. 


A key concern is that maybe yes-respondents were more, or less, likely 
to phone in than no-respondents. For example, if yes-respondents were 
less likely to phone in, then the sample almost certainly contained an 
unrepresentatively low proportion of yes-responses. Consequently, the 
figure 3396 is biased and too low when taken as an estimate of the 
percentage of all Americans in favour of the UN being based in the US. 


Concerned about the accuracy of its phone-in polls generally, the TV 
network conducted an independent survey of the entire American 
population using proper probability sampling techniques. A SRSWOR 
of 1,000 persons yielded 7296 yes-responses to the same question and 
2896 no-responses. 


From these results, we may suspect that yes-respondents were indeed 
less likely to phone in than no-respondents. This prompts us to now 
study the issue in more depth, starting with the following model. This 
model and parts of the subsequent exposition can also be found in Puza 
and O'Neill (2006). 


12.4 A classical model for self-selection bias 


Suppose that there are a large number N units in the population (e.g. 
persons in the US) and each unit has the same probability p of having a 
particular characteristic in question (e.g. being in favour of the UN being 
based in the US). 
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Then define: 
y; as the indicator for pop. unit i having the characteristic (0 or 1) 


7, as the probability that unit i will be sampled (e.g. phone in to 
answer the question) 
I, as the indicator that population unit i is sampled. 


In this context the data is D = (n, Yr), where: 
n= I, *...* I, is the observed sample size 


Yor =Y, +--+, is the number of yes-respondents in the sample. 


Now, a ‘naive’ or ‘base’ model here is 


Yor ^ Bin(n, p), 
and this leads to the straight sample proportion 


Y, = Ysr / n 
as an estimate of p. 


We now wish to generalise this model to account for the possibility that 
y, may be biased. To this end, suppose each 7, can be one of two 
values: 

Ø, if that unit has the characteristic in question, i.e. if y, = 1 


Ø, if that unit does not have the characteristic, i.e. if y, = 0. 
Note: We may then write 7; = 9, . 


Next, suppose that a unit with the characteristic in question is A times as 
likely to respond as a unit without the characteristic. Thus 


h - Ad. 
Also, write ¢ simply as ¢. Then, 


if y, 20 
"m ded c = ga". 
Ag if y, 21 
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With the above definitions, we now consider the probability of a 
respondent having the characteristic (as distinct from the probability of a 
nonrespondent having the characteristic): 


_ PQ: =DPQ; =1|y, =) 
P(I, =1) 
(note that we are applying Bayes’ rule here) 


P(y, -1|I, -1) 


_ P(y, =D)PCU, =1| y; =D 

P(y, =0)PC, =1| y, =0)+ P(y, =DPC, =1| y; 21) 
: Pd, | póà _ ip 

(1p) + pd, (1-p)p+ póà 1-p+ pa 


Note: Observe how one of the parameters, namely ¢, cancels out here. 


pa 


We may now write y., ~ Bin(n,o) , where o = —————. 
1-p+pa 


Next, the MLE and method of moments estimator of @ is Y,=y,,/Nn. 


pA e 
Also, solving @ = —————— for elds p = ————. 
B 1—p+ pa Ee p A-AO+O 
It follows that the MLE and MOME of p is p=——2>—. 
A= AY, + yY, 


Ti 

Also, a.n-[x zr M HM isal—a@ Clfor o. 
n 

Therefore, a 1— € CI for p is ——— — : 
A-AL+L A-AU+U 


It is of interest to now discuss the biases of the two estimators mentioned 
above. First, the bias of y, is 


BG) -e- p- P|} A )- (1- p)1— A) 


1—p+pa 1- p(1-A) ` 
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This is not zero but reduces to zero when 4 —1, i.e. when 7, = 75. 


-— ie y. 
Also, the bias of p is B(p) = E| —————— |- 
b is B(p) RR p 


Just like B(y,), B(p) is nonzero but reduces to zero when 4 =1. But 
unlike B(y,), B(p) converges to zero as the sample size n tends to 
infinity, this being true for all A. 


Ys — is asymptotically unbiased for p as n > œ. 


That is, Pe 2. 
—AyY, ty, 


Note: This is obvious by construction. But just to check, we note that 


Ey: MEN ae and Vy, < ©. Therefore 
1- p+ pA 
B(p) > EL 5 -p-0asn- o. 
dee dee 
1-p- pA Jp pA 


Example 12.1 Application to the US TV network scenario 
(a classical analysis) 


1- 
Observe that @= = implies 4 = eX1- p) 


1- p+ pa p-o) : 


Then recall that the phone-in poll conducted by the TV network yielded 
an estimate of 0.33, and that the parallel scientifically designed (and 
‘proper’) survey yielded an estimate of 0.72. 


Thus we may estimate Å = 7, / zt, by 
ôL- P) 0330-072) _ 


Az = = 0.19. 
p(l-6) 0.72(1-0.33) 


This estimate being less than unity is consistent with our earlier intuition 
that the phone-in poll estimate might be too low due to yes-respondents 
being less likely to phone in than no-respondents. 
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Example 12.2 Inference on p in a flag poll (a classical analysis) 


On 28 January 2000 an Internet poll was conducted by the Nine TV 
Network in Australia with the question: 
Should the Australian flag be replaced by a new one? 


To this poll there were 4,941 yes-responses and 4,512 no-responses, thus 
a proportion of 
4,941/(4,941 + 4,512) = 4,941/9,453 = 0.523 yeses. 


A similar question was asked in the Australian Constitutional 

Referendum Study, 1999 (Gow et al., 2000), and this proper survey 

yielded 829 yes-responses and 1,394 no-responses, thus a proportion of 
829/(829 + 1,394) = 829/2,223 = 0.373 yeses. 


Hence, for the 28 January Internet poll we may estimate 4 = 7, / zt, by 
Q(1— p)  0.523(1— 0.373) E 


A-T—EZ 1.84. 
B(-4) 0.373(1-0.523) 


This suggests that persons who wanted the flag replaced were almost 
twice as likely to register their opinion via the Internet poll as persons 
who were happy with the old flag. 


Example 12.3 Inference on p in a currency poll 
(a classical analysis) 


On 4 June 2000 an Internet poll was conducted by the Nine TV Network 
with the question: 
Should the Queen's image be removed from our currency? 


To this there were 2,544 yes-responses and 1,755 no-responses, thus a 
proportion of 
2,544/(2,544 + 1,755) = 2,544/4,299 = 0.592 yeses. 


Now recall Example 12.2. Clearly there is some similarity between the 
two polls. Both were conducted on the Internet by the same organisation 
within the same half-year, and the two questions asked both relate to 
changing something about Australia's heritage. This similarity suggests 


that 1.84 may be a plausible value of 4 = 7, / zt, to be used in the 4 June 
poll here. 
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If so, we may estimate the true proportion of Australians in favour of 
removing the Queen's image from our currency as 


p-———— —— — $70 
À-Ay,*y, 184-1.84x0.592 + 0.592 


A 
Then, a 95% CI for a= -M (the probability of a yes-response for 
1—p+ pA 


a respondent) is 


= E 
(LU) =| y, 2,4, |: | =| 0.59241.98, [999200992 
" 4,299 


= (0.577, 0.607). 


Therefore, a 1—q@ CI for p is 


L U 
um E m 
E 0.577 0.607 
i x —1.84 x 0.577 + 0.577 ' 1.84 — 1.84 x 0.607 + Td 
= (0.426, 0.456). 


12.5 Uncertainty regarding the sampling 
mechanism 

In Example 12.3 above, the value of A was taken to be exactly 1.84. 
However, there is in fact uncertainty about 2 which ought to be taken 


into account and perhaps lead to a wider CI for p than the one reported. 


With this in mind we now postulate the following Bayesian model: 


pa 


(ya | P,4) ~ Bin(n,@) where @=———— (as before) 
1—p+ pA 

(p|A) ~ Beta(a, B) 

A ~ Gamma(7,T). (12.2) 


Note: This model implicitly conditions on the sample size n. 
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Example 12.4 Bayesian re-analysis of poll data Example 12.2 
Recall the 28 January 2000 Internet poll yielding 4,941 yeses out of 


9,453 responses and the related properly conducted probability survey 
yielding 829 yeses and 1,394 nos. 


This suggests we apply the Bayesian model (12.2) in WinBUGS to 
estimate A , with: 


7 = Tt =0.000001 (implying an uninformative prior on A ) 
a =829+1= 830, 2 = 1,394 + 1 = 1,395 


(the posterior of p implied by the proper survey in a 
binomial-beta model and then fed here as the prior for p) 


n = 9,453, Yr = 4941 
(the observed data in the self-selected sample). 


Using suitable WinBUGS code (see below) and a sample size of 10,000 


after a burn-in of 1,000, we obtained results shown in Table 12.2. Figure 
12.7 shows some of the graphical output from WinBUGS. 


Table 12.2 Results of WinBUGS analysis 


node mean sd MC error 2.5% median97.5% 
lam 1.843 0.08879 0.00271 1.677 1.841 2.026 
p 0.373 0.01022 3.15E-4 0.3529 0.373 0.393 


We see that A has been estimated as 1.84 again, but now with some 
measure of uncertainty: the 9596 posterior interval estimate for A is 
(1.68, 2.03). 
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Figure 12.7 Graphical output from WinBUGS 


lam 


2.5 


20 


1.5 


1 5000 10000 


lam sample: 10000 p sample: 10000 
60.0 
40.0 
NC 20.0 HN ao 
0.0 
14 16 18 20 22 0.32 0.36 0.4 


Equating the sample mean and sample variance of the 10,000 simulated 
values with the theoretical mean and variance of the Gamma(y,r) , 
namely 7/7 and 7/7, respectively, we may approximate the posterior 
distribution of 4 as Gamma(,7) with 7 = 431 and T = 234. 


Figure 12.8 shows a histogram of the simulated values overlaid by the 
gamma density defined by these parameters. We see that the gamma 
posterior approximation fits quite well. 


Figure 12.8 Histogram of simulated values and fitted gamma 
density 


Density 


lambda 
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WinBUGS Code for Example 12.4 


model; 
1 
ysT^ dbin(omega,n) 
omega «- (p*lam)/(1-p+lam*p) 
lam ~“ dgamma(eta,tau) 
p ^ dbeta(alpha,beta) 
} 


# data 
list(ysT24941,n29453,eta-0.000001, 
tau=0.000001,alpha=830,beta=1395) 


# inits 
list(p=0.5,lam=1) 


R Code for Example 12.4 
# Need to run BUGS code above first, using coda to create output in data.txt 


options(digits=3); 0.33*0.28/(0.72*0.67) # 0.192 
0.523*(1-0.373)/(0.373*(1-0.523)) # 1.84 
0.592/(1.84-1.84*0.592+0.592) 4 0.441 

Clomega = 0.5924c(-1,1)*1.96*sqrt(0.592*(1-0.592)/4299) 
Clp = (Clomega/(1.84-1.84* Clomega+Clomega)) 
c(Clomega, Clp) # 0.577 0.607 0.426 0.456 


out=read.table(file=file.choose()) # choose data.txt from BUGS run 

lamvec = out[1:10000,2]; options(digits=5) 

lambar=mean(lamvec); lamvar=var(lamvec) 

taufit=lambar/lamvar; etafit=lambar*taufit 

c(lambar, lamvar, etafit, taufit) 
# 1.8432e+00 7.8849e-03 4.3087e+02 2.3376e+02 

summary(lamvec) 

# Min. 1st Qu. Median Mean 3rd Qu. Max. 

H 1.55 1.78 1.84 1.84 1.90 2.20 

X11(w=8,h=4); par(mfrow=c(1,1)) 

lamv <- seq(1.4,2.4,0.001) 

fv <- dgamma(lamv,431,234) 

hist(lamvec, prob=T,xlim=c(1.4,2.4), ylim=c(0,5),xlab="lambda",cex=1.5, 
breaks=seq(1,3,0.025), main="") 

lines(lamv,fv,lwd=3) 
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Example 12.5 Bayesian re-analysis of poll data in Example 12.3 
using results in Example 12.4 


Recall the 4 June 2000 poll yielding 2,544 yeses out of 4,299 responses, 
leading to 0.441 as an estimate of p, with 9596 CI (0.426, 0.456), based 
on A being exactly equal to 1.84. This suggests we apply our Bayesian 
model in WinBUGS to estimate p with: 


n = 431, T - 234 
(using the posterior for 2 in Example 4 as the prior) 


a=fp=1 (implying an uninformative prior for p) 


n= 4,299, y, = 2,544 
(the observed data in the self-selected sample). 


Using suitable WinBUGS code (see below), we obtained the results 


shown in Table 12.3. Some of the graphical output is shown in Figure 
12.9. 


Table 12.3 Results of WinBUGS analysis 


node mean sd MC error 2.5% median 97.5% 
lam 1.841 0.08801 0.001991 1.67 1.84 2.014 
p 0.4409 0.01408 3.18E-4 0.414 0.4406 0.4698 


We see that p has been estimated as 0.441 again, with 95% interval 
estimate (0.414, 0.470). It will be noted that this interval is wider than 
the one in Example 12.3; this may be attributed to the fact that in 
Example 12.3 uncertainty regarding 2 was not properly taken into 
account. For more information on the topic in this section, see Puza and 
O’ Neill (2006). 


Note: The posterior for A is virtually the same as the prior for A . This 
was to be expected, since—unlike in Example 12.4—the data here 
does not contain any structure which could tell us anything about the 
relationship between the sampling propensities z, and 7. 
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Figure 12.9 Graphical output from WinBUGS 


lam sample: 10000 p sample: 10000 
30.0 
20.0 
10.0 
0.0 
0.35 0.4 0.45 


WinBUGS Code for Example 12.5 


model; 
{ 
ysT^ dbin(omega,n) 
omega <- (p*lam)/(1-p+lam*p) 
lam ~“ dgamma(eta,tau) 
p ^ dbeta(alpha,beta) 
) 


H data 
list(ysT=2544 n=4299,eta=431, 
tau=234,alpha=1,beta=1) 


# inits 
list(p=0.5,lam=1) 


12.6 Finite population inference under 
selection bias in volunteer surveys 


In the last section on selection bias in volunteer surveys, the finite 
population size N was introduced at the beginning, but then seemed to 
disappear from the notation. The Bayesian model subsequently 
developed did not feature N at all. 


This is a clue to the fact that the Bayesian model in that section is only 
useful for infinite population inference, in particular on the 
superpopulation parameter p, and cannot be used for inference on finite 
population quantities, in particular the finite population mean 


Y=Q, +--+ yYy)/N. 
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This is not an issue when N is very large (as it was assumed there), since 
in that case inference on y is, by the law of large numbers, virtually 
identical to inference on the superpopulation mean p. 


The following exercise develops a ‘true’ Bayesian finite population 
model in the same setting, one which could be useful in scenarios where 
N is not so large as to be effectively infinity. 


Exercise 12.3 A Bayesian finite population self-selection model 


Consider a finite population of N units, where each unit has common 
probability p of having some characteristic, independently of all the 
other units, and where our prior beliefs regarding p can be represented 
by way of a beta distribution with parameters œ and f. 


A sample is selected from the finite population in such a way that every 
unit without the characteristic has probability ø of being sampled, and 
every unit with the characteristic has probability Aø of being selected. 
Every unit that is sampled has its value fully observed. 


The prior on ¢ is beta with parameters 6 and y but evenly spread over 
the interval (0, c), where c < 1 is a specified constant representing an 
absolute upper bound for what the value of ø could possibly be. 
(Examples of a potentially suitable values of c are 0.1, 0.2 and 0.5.) 


Also, the prior on A is beta with parameters 7 and 7 but evenly spread 
over the interval (0, 1/c), so as to permit a suitably wide range of 
possible values for the ratio of sampling propensities z, = 49 to z, — 9. 
(For example, if c = 0.2 then that ratio could be anything from 0 to 5.) 


(a) Write down a Bayesian model which comprehensively represents the 
above situation. Assume that all of the model parameters are 
independent a priori. Clearly identify the data. 


(b) Suppose we are interested in both the superpopulation mean (i.e. the 
common probability of a unit having the characteristic, p) and the finite 
population mean (i.e. proportion of the N finite population units which 
have the characteristic, y ). Write down a formula for the joint posterior 
(and predictive) density of all quantities which are relevant to and could 
used be as a basis for the desired inference. 
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(c) Use the density in (a) to construct a suitable Metropolis-Hastings 
algorithm. Then run the algorithm in R so as to redo the analyses in 
Examples 12.4 and 12.5. Perform each new analysis thrice, assuming the 
finite population size N is 200,000, 400,000 and 40,000, respectively. 


(d) Modify the MH algorithm in (c) so that its output features only the 
three model parameters and none of the nonsample values. (NB: The 
idea here is to design a superior MH algorithm, one with better *mixing' 
than the one in (c).) 


(e) Describe a procedure whereby the output from the algorithm in (d) 
could be used to obtain a sample from the predictive distribution of the 
nonsample mean. Then run that algorithm and implement the procedure 
so as to produce results intended to be equivalent to those in the 
reanalysis of Example 5 in (c) with N = 200,000. 


Solution to Exercise 12.3 


(a) With y = (y,,..., Yy) and I - (1,,..., I,), the Bayesian model may be 
written as follows: 


(I |y, p, 4,9) ~L Bernoulli(QA"), i=1,..., N 


(s Yy | P»4,9) ~ iid Bernoulli( p) 


(p|4,9) ~ Beta(a, B), (4|) ~ (1/c) x Beta(7,7) 


~~ cx Beta(ó, y) (0<c<1). 


Note: The sampling mechanism here is nonignorable and unknown, 
since f(I |y, p, 4,9) depends on the unknown quantities and A . If 
A were equal to 1 then the sampling mechanism would again be 
unknown but in that case ignorable, since ¢ L p a priori. 


The data here may be written as D = (n, Yp), where: 
N 
n= > I, isthesample size 
i=1 


ya » y; is the number of sampled units with the characteristic. 


ies 
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Since the data is a function of (I,y,), the relevant joint posterior/ 
predictive density is 
f (9,4, p. y, |I, y) f (9, A, p, A, Y 1, y,) 


= f(.À P, yl, Ly) 


= FAFA FD)x fil PFO, |p) 
Xf. y, 9, APU, |y, 6, A) (12.3) 


_@/ c)" -9/c)y^ " c(cAy (1- cA)! " p`- p) 
cB(ó, y) B(n,T) B(a, p) 


irc gronde] 


ies ier ies 


(TIG y ey") (12.4) 


ier 


oc f° (107 8/ c) ! x AC (1D- cA)! x p^! (1- py" 
spr a-p p*(L- py" 


TIG J 1-42" fne ) 1- o2 ") (12.5) 


ies ier 


- $^" ü-ó/c) "XAT (- cA)" x p^'ü- pe^ 
xp? (1- p)" x grays (1- pay” a-p) s (42.6) 


Note 1: In all of the above e.g. (12.3), s and r are fixed at their 
observed values. 


Note 2: In the step from (12.4) to (12.5), be aware that I, =1Vies 
and J, CO Vier. 
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Note 3: In the step form (12.3) to (12.4), f (9) is derived as follows. 


E g . 5 EE w) 
If e Beta(ó,y) then f(w) Ry 
Therefore 
7 dw| (9/c)"'(0-9/0)"! 
f(¢) = f(w) dol cB») . 


A similar logic can be used to derive the density 
c(cA)" (17 cA)! 
ra) = DOA) 
B(77,T) 


(b) Examining the density in (a), in particular (12.6), we see that: 
f (Yer |D. 4,4, p) « [pa ¢ay}” [a- Da- p] 


> (Yn | D, 9,4, p) id Bin(N —n,q), 
p(1- 9A) 
p(1—- ga) +(1— p)- 9) 


where q= 


(12.7) 


f(p | D, A À, Ya) = pei- p)yrP ac 


EOD ID SA yu) Beat Vat y up Ny veysy 


(12.8) 
Also: 
f (61D, y, A, p) « 9" 10-9/cy ^ Q- Ayr (1- gy» 
(12.9) 
f (A |D, y,,, d p) ec A A- cA) "Q- A)". (12.10) 


The above implies a suitable MH algorithm with two Gibbs steps as 
defined at (12.7) and (12.8) and two Metropolis steps as defined by 
(12.9) and (12.10). 
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(c) The MH algorithm in (b) was applied with the following 
specifications so as to redo the analysis in Example 12.4: 


N -200,000, n2 9453, y,, - 4941, c=0.2 
a -830, 821395, n=t=1, ó-y-1 


A run with burn-in 2,000 followed by J = 10,000 iterations for inference 
was performed. Numerical results from this run are shown in Table 12.4. 


Table 12.4 Monte Carlo inferences using N = 200,000 


phi, @ lam, A p ybar, y 

0.03597 1.84686 0.37259 0.37259 mean of simulated values 
0.08789 0.08789 0.01017 0.01022 sample standard deviation 
0.03449 1.68272 0.35266 0.35250 LB of 95% CPDR estimate 
0.03749 2.02311 0.39190 0.39202 UB of 95% CPDR estimate 


Our point and interval estimates for 2 are 1.85 and (1.68, 2.02), which 
are very similar to 1.84 and (1.68, 2.03) in Example 12.4. 


Note: The primary object here is estimation of 2, not of p or y. But it 
will be noted that the estimates of these other two quantities (p or y ) 
are very alike, which is as one might expect. 


Repeating the above but with finite population sizes 400,000 and 40,000, 
respectively, we obtain the corresponding results shown in Tables 12.5. 


Table 12.5 Inferences using different N (same details as in 
Table 12.4) 


N = 400,000 N = 40,000 
phi, ọ lam 4 p ybar, y phi, lam 4 p  ybany 


0.01803 1.83548 0.37394 0.373948 0.18123 1.81588 0.375693 0.375834 
0.08546 0.08546 0.00981 0.009832 0.07579 0.07579 0.009203 0.009399 
0.01731 1.68407 0.35413 0.354113 0.17492 1.66922 0.357356 0.357050 
0.01878 2.00923 0.39122 0.391193 0.18813 1.97208 0.393969 0.394500 


Note: The three sets of inferences in Tables 12.4 and 12.5 have yielded 
different estimates of ¢ but very similar results for the other three 


quantities, in particular the object of this study, A. 
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Figure 12.10 shows graphical output from the first of the three 


Metropolis-Hastings algorithms (i.e. the one with N = 200,000). 


Figure 12.10 Graphical output from run with N = 200,000 
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Next, a beta distribution was fitted to the 10,000 simulated values of A 
above (taken from the run with N — 200,000) so as to define the 


approximate posterior given by 
(A | D) ~ (1/ c) x Beta(1/, 7^), 
where 77’ = 278.1 and z' = 474.8 (with c = 0.2 as before). 


This posterior for A was then fed in as the prior for A so as to redo the 
analysis in Example 12.5. 


Accordingly, the MH algorithm in (b) was next applied once again but 
with the following specifications: 


N -200,000, n-4299, y, -2544, c=0.2 
g Lh HsL ņ=278.1, T =474.8, =y =1. 


The relevant numerical estimates are as shown in Table 12.6. 
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Table 12.6: Inferences using N = 200,000 and a fitted beta prior 


phi, ó lam, A p ybar, y 

0.01570 1.84272 0.44049 0.45248 mean of simulated values 
0.08792 0.08792 0.01408 0.01403 sample standard deviation 
0.01495 1.67553 0.41344 0.42555 LB of 9596 CPDR estimate 
0.01656 2.01139 0.46602 0.47799 UB of 9596 CPDR estimate 


Thus point and interval estimates for p are 0.440 and (0.413, 0.466), 
which we note are similar to 0.441 and (0.414, 0.470) in Example 12.5. 


Also point and 9596 interval estimates for y are 0.452 and (0.426, 
0.478). 


Note 1: The inference on y here was not possible using the theory in 


the section just above the present exercise, i.e. using the infinite 
population models developed in that section. 


Note 2: The posterior for A is very similar to its prior, which is as one 
might expect, since the data now has no structure which could tell us 
anything further about that parameter. 


Repeating the above but with finite population sizes 400,000 and 40,000, 
respectively, we obtain the corresponding results shown in Tables 12.7. 


Table 12.7 Inferences using different N (same details as in 
Table 12.6) 


N = 400,000 
phi, ø lam, 4 p  ybany 


0.007863 1.83516 0.44193 0.44792 
0.087755 0.08776 0.01375 0.01372 
0.007482 1.66809 0.41563 0.42160 
0.008299 2.00048 0.46819 0.47409 
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N = 40,000 
phi, ø lam 4 p ybar, y 
0.07888 1.82895 0.44228 0.50220 
0.08162 0.08162 0.01359 0.01337 


0.07538 1.66402 0.41490 0.47517 
0.08278 1.99275 0.47007 0.52985 
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Discussion 


Something to be noted above is that estimation of y appears to increase 
slightly as N decreases, whereas estimation of p remains about the same. 


Estimation of ø also increases as N decreases. This could present a 
‘problem’ if N is ‘too small’. Figures 12.11, 12.12 and 12.13 (pages 598 
and 599) show histograms of the simulated values when N = 200,000, 
20,000 and 15,000, respectively. 


We see no problem in the first two of these three cases. But for 
N = 15,000, the estimation of ¢ appears to be artificially restricted by 
our arbitrary choice of c as 0.2. (Observe that the simulated values are 
strongly ‘bunched up’ at just below 0.2.) 


Repeating the MCMC run with N =15,000 but with c also changed to 
0.5 appears to solve this problem. Results are shown in Figure 12.14 
(page 599). We note that estimation of 2 has changed from about 2 to 
less than 1. This suggests that we might get very similar results with c 
even larger, e.g. c = 1. 


But when we do this, we get very different results (not shown). Why? 


Because when we changed c from 0.2 to 0.5, we forgot to reconfigure 
the prior for 4 , which also involves c. 


Note: The prior for ¢ also involves c but does not need reconfiguring 
(because that prior is uniform for all values of c, since 6 = y =1). 


Thus, Figure 12.14 (the case of N = 15,000 and c = 0.5) in fact illustrates 
output which is ‘flawed’ (in this sense) and so should be disregarded. 


Although these technical issues could satisfactorily be resolved with 


some effort, we will leave that task as an avenue of investigation for 
further research and move on to answering part (d). 
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Figure 12.11 Histograms using N = 200,000 and c = 0.2 
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Figure 12.12 Histograms using N = 20,000 and c = 0.2 
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Figure 12.13 Histograms using N = 15,000 and c = 0.2 
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Figure 12.14 Histograms using N = 15,000 and c = 0.5 
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(d) Recall the joint density (12.6). This density may also be written as: 


f (9, A, DY, | T,y,)« f (9, A, p)p'*"* d= p) «7 
x Da (1— dA (1- 9) "7 ' 
where f (4.4, p) oc 9^(1- H / cy x A 0—cAy? " p^ d oy. 


Now observe that 
FGA p, y, Ly.) c FGA, pd pa-p epa. 
where: £-[pa-40]" [a- ba- o] ^" 
-[na-&)«a- pa-9] " x[[z'a- 2^" 


O UW 
pa- $4) (L= pU- 


Further observe that 


YI] Zz (1- z)” — I] > z"(1— z)” =1 


Yr der ier yj-0 
(since the first product is the joint pdf of N —n iid Bernoulli(z) 
variables). 


It follows that 


f (6.4. pL y) => f (6.4. p. y, | Ly.) 


« f(A, p)x La ü- py "glam | 
x[p1-é2)«Q- pa-9]' ". 


The above defines a MH algorithm with three steps based on the 
following conditionals: 


f (6|D,À, p « ó^""1-9/cy"^[pa- 44) - a- pa- o " 
FAID, g p) e A" (- cA)" | p- ga) + (- pa- 9] " 


f (p| D, 4,4) « pT a- p m pa-p) +a- pa-9] 
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(e) From the working in (d) we see that 
(Vir |I, y, 6,4, p) ~ Bin(N —n,z), 
where 
n- PUW 
p(1- 9A) - (1— pY1— 9) 


(12.11) 


So, to get a sample from the predictive distribution of y we do as 
follows: 


1. Obtain ($,A,, p;) v iid f(&,A, p|Ly,), f= 1,...,J 
using the MH algorithm in (d) 


2. Sample y ~ Bin(N —n,z;), where 
TEn p,d-¢,,) j 
i p,1-¢,4,)+(- p)d-¢,) 


= 1,...,J (from (12.11)) 


1 
3. Calculate y^ ur +y), j=1,..J. 


We now perform the MH algorithm in (d) and the above procedure with: 
N = 200,000, n=4299, y, =2544, c-02 
d 51, B=1, ņ=278.1, T =474.8, =y =1. 


We thereby obtain the inferences shown in Table 12.8. 


Table 12.8 Results obtained in part (e) 


phi, ó lam, A p ybar, y 

0.01567 1.8491 0.43973 0.43973 mean of simulated values 
0.08660 0.0866 0.01387 0.01382 sample standard deviation 
0.01491 1.6844 0.41331 0.41346 LB of 95% CPDR estimate 
0.01650 2.0278 0.46689 0.46673 UB of 95% CPDR estimate 


We see that inferences are very similar to those in the reanalysis of 
Example 12.5 in (c) with N = 200,000 (where y was estimated as 


0.45248). But the results here should in fact be considered more accurate 
because they are based on a MH algorithm with fewer components. 


601 


Bayesian Methods for Statistical Analysis 


Note 1: The inferences on y could be further improved via Rao- 
Blackwell arguments which obviate the need to sample values of y,, 
at all. In particular, the Rao-Blackwell estimate of the predictive mean 
of the finite population mean, y = E(y | D) , is 
di 
polo z; = 0.4364, 
Je 


with 95% Cl for y 


il Ji 
Z £196 Z,—Z) E 01361041367)! 
| aL. J ( ) 


Actually, this is not quite right, since Z is the Rao-Blackwell estimate 
of y. = E(y, |D), and the 95% CI is for y To see this, refer to 
(2): 
Thus, since 
1 

y =— UN =). 

ae Sy) 
the RB estimate of y is actually 

1 

wr -(N-n)z) = 0.440, 
with a 9596 confidence interval for y equal to 


e y, +(N —n)0.4361, Le yO = n)0.4367 | = (0.439, 0.440). 


Note 2: The Monte Carlo 95% confidence intervals reported here are 
unduly narrow (i.e. will have less than 95% actual coverage). This is 
because we did not address the problem of the very strong serial 
correlation amongst the values outputted from the Metropolis-Hastings 
algorithm, for example by way of thinning or the batch means method. 


But this remark only applies to confidence intervals for mean estimates 


and not to posterior or predictive interval estimates, such as (0.413, 
0.467) for y in Table 12.8. 
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R Code for Exercise 12.3 


MH = function(J2100, n=9453, ysT=4941, alp=830, bet=1395, 
p=0.5, phi0=0.1, lamO-1, phisd=0.1, lamsd=0.1, 
eta=1, tau=1, del=1, gam=1, c=0.2, N=200000 ){ 
phi=phi0; lam=lam0; phiv=phi; lamv=lam; phict=0; lamct=0; pv=NA; yrTv=NA 
for(j in 1:J){ 
q-p*(1-phi*lam)/( p*(1-phi*lam) + (1-p)*(1-phi) ) 
yrT=rbinom(1,N-n,q); yTzysT*yrT; p=rbeta(1,alp+yT, bet+N-yT) 
phinewzrnorm(1,phi,phisd) 
if((phinew>0)&&(phinew<c)){ 
logprobnum=(del-1)*log(phinew)+(gam-1)*log(1- phinew/c)+ 
n*log(phinew) +yrT*log(1- phinew*lam)+(N-n-yrT)*log(1-phinew) 
logprobden=(del-1)*log(phi)+(gam-1)*log(1-phi/c)+ 
n*log(phi) +yrT*log(1-phi*lam)+(N-n-yrT)*log(1-phi) 
logprob= logprobnum- logprobden; prob=exp(logprob) 
u=runif(1); if(u<=prob){ phict=phict+1; phi=phinew } } 
lamnew=rnorm(1,lam,lamsd) 
if((lamnew>0)&&(lamnew<(1/c))){ 
logprobnum= (eta-1)*log(lamnew)+(tau-1)*log(1- lamnew*c)+ 
ysT *log(lamnew)+yrT*log(1-phi*lamnew) 
logprobden= (eta-1)*log(lam)+(tau-1)*log(1-lam*c)+ 
ysT*log(lam)+yrT*log(1-phi*lam) 
logprob= logprobnum- logprobden; prob=exp(logprob) 
u=runif(1); if(u<=prob){ lamct=lamct+1; lam=lamnew } } 
phiv=c(phiv,phi); lamv=c(lamv,lam); pv=c(pv,p); yrTv2c(yrTv,yrT) } 
phiar=phict/J; lamar=lamct/J 
list(pv=pv, yrTv=yrTv, phiv=phiv, lamv=lamv, phiar=phiar, lamar=lamar) } 
# end fn 
X11(w=8,h=6); par(mfrow=c(2,2));  options(digits-5); N=200000 


set.seed(531); res=MH(J=2000, n=9453, ysT=4941, alp=830, bet=1395, 
p=0.5, phi0O=0.1, lamO=1, phisd=0.0007, lamsd=0.04, 
eta=1, tau=1, del=1,gam=1, c=0.2, NzN) 
c(resSphiar,resSlamar) # 0.513 0.536 OK 
plot(resSpv); plot(resSyrTv); plot(resSphiv); plot(resSlamv) # Has burnt in OK 
pO-resSpv[2001]; lamO-resSlamv[2001]; phi0=resSphiv[2001] 
# record last values 
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set.seed(131); K=10000; date() # 
resz MH(J-K, nz9453, ysT=4941, alp=830, bet=1395, 
p=p0, phi0=phi0, lamO-lamO, phisd=0.0006, lamsd=0.04, 
eta=1, tau=1, del=1, gam=1, c=0.2, N=N ); date() # 
c(resSphiar,resSlamar) # 0.5548 0.5707 OK 
plot(resSpv); plot(resSyrTv); plot(resSphiv); plot(resSlamv) # OK 


# Example of optional thinning to reduce serial correlation: 

# acf(resSpv[-1]); acf (resSyrTv[-1]); acf (resSphiv[-1]); acf (resSlamv[-1]) 

# skip=10; inc=1+seq(skip,K,skip); J=length(inc); J # 1000 

# pv= resSpv[inc]; yrTv= resSyrTv[inc]; phiv=resSphiv[inc]; lamv=resSlamv[inc] 
# acf(pv); acf(yrTv); acf(phiv); acf(lamv) # better 


skip=1; inc=1+seq(skip,K,skip); J=length(inc); J # 10000 (Just use whole 
sample) 

pv» resSpv[inc]; yrTv= resSyrTv[inc]; phivzresSphiv[inc]; lamv=resSlamv[inc] 
hist(pv,prob=T); hist(yrTv,prob=T); hist(phiv,prob=T); hist(lamv,prob=T); # OK 


# Calculate estimates (Note we could improve these via Rao-Blackwell): 
phat=mean(pv); pcpdr=quantile(pv,c(0.025,0.975)); pse=sd(pv) 
lamhat=mean(lamv); lamcpdr=quantile(lamv,c(0.025,0.975)); lamse=sd(lamv) 
phihat=mean(phiv); phicpdr=quantile(phiv,c(0.025,0.975)); phise=sd(lamv) 
n= 9453; ysT=4941; ybarv=(1/N)*(ysT+yrTv); 
ybarhat=mean(ybarv); ybarcpdrzquantile(ybarv,c(0.025,0.975)); 
ybarse=sd(ybarv) 
print(cbind(c(phihat, phise ,phicpdr), c(lamhat, lamse ,lamcpdr), 

c(phat, pse,pcpdr), c(ybarhat,ybarse, ybarcpdr)), digits=4) 


H B ---------------------------------- 

# phi lam p ybar 

# 0.03597 1.84686 0.37259 0.37259 mean 
# 0.08789 0.08789 0.01017 0.01022 se 

#2.5% 0.03449 1.68272 0.35266 0.35250 LB 

# 97.5% 0.03749 2.02311 0.39190 0.39202 UB 


# Repeat above exactly from A to B but after setting N=400000. Results: 


# 0.01803 1.83548 0.37394 0.373948 
# 0.08546 0.08546 0.00981 0.009832 
#2.5% 0.01731 1.68407 0.35413 0.354113 
# 97.5% 0.01878 2.00923 0.39122 0.391193 
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# Repeat above exactly from A to B but after setting N=40000. Results: 


# 0.18123 1.81588 0.375693 0.375834 
# 0.07579 0.07579 0.009203 0.009399 
# 2.5% 0.17492 1.66922 0.357356 0.357050 
# 97.5% 0.18813 1.97208 0.393969 0.394500 


# Now calculate new prior from posterior of lambda (based on 1st run above): 
c(lamhat,lamse) # 1.846864 0.087889 
fun=function(etatau, c=0.2, est=lamhat, se=lamse){ 
(est-(1/c)*etatau[1]/sum(etatau))*2+ 
( se^2 - (1/c^2)*prod(etatau)/( sum(etatau)*2*(1+sum(etatau))) )^2 } 
etataunew0 = optim(par=c(2,5), fn=fun)Spar 
etataunew = optim(par= etataunewO, fn=fun)Spar 


etanewzetataunew[1]; taunewzetataunew[2] 

c(etanew, taunew) # 278.10 474.79 

(1/0.2)*etanew/(etanew-taunew) # 1.8469 

sqrt((1/0.2^2)*etanew*taunew/((etanew-taunew)^2*(etanew-taunew-1))) 
# 0.087889 OK 


# Now run MCMC with new prior and data: ------------------------------ 
par(mfrow=c(2,2)); N-200000 


set.seed(531); res=MH(J=2000, n=4299, ysT=2544, alp=1, bet=1, 
p=0.5, phiO=0.1, lamO-1, . phisdz0.0007, lamsd=0.04, 
eta=etanew, tau=taunew, del=1, gam=1, c=0.2, NzN) 
c(resSphiar,resSlamar) # 0.4295 0.5485 OK 
plot(resSpv); plot(resSyrTv); plot(resSphiv); plot(resSlamv) # Has burnt in OK 
p0=resSpv[2001]; lamO-resSlamv[2001]; phiO=resSphiv[2001] 
# record last values 


set.seed(131); K=10000; date() # 
reszMH(J-K, n=4299, ysT=2544, alp=1, bet=1, 
p=p0,  phiO-phiO, lamO=lam0, phisd=0.0004, lamsd=0.05, 
eta= etanew, tau- taunew, del=1, gam=1, c=0.2, NzN ); date() # 
c(resSphiar,resSlamar) # 0.5473 0.5908 OK 
plot(resSpv); plot(resSyrTv); plot(resSphiv); plot(resSlamv) # OK 


skip=1; inc=1+seq(skip,K,skip); J=length(inc); J # 10000 (Just use whole 
sample) 

pv» resSpv[inc]; yrTv= resSyrTv[inc]; phiv=resSphiv[inc]; lamv=resSlamv[inc] 
hist(pv,prob=T); hist(yrTv,prob=T); hist(phiv,prob=T); hist(lamv,prob=T); # OK 
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# Calculate estimates (Note we could improve these via Rao-Blackwell): 


phatzmean(pv); pcpdr=quantile(pv,c(0.025,0.975)); pse=sd(pv) 
lamhat=mean(lamv); lamcpdr=quantile(lamv,c(0.025,0.975)); lamse=sd(lamv) 
phihat=mean(phiv); phicpdrzquantile(phiv,c(0.025,0.975)); phise=sd(lamv) 
n= 9453; ysT=4941; ybarv=(1/N)*(ysT+yrTv); 
ybarhat=mean(ybarv); ybarcpdrzquantile(ybarv,c(0.025,0.975)); 
ybarse=sd(ybarv) 
print(cbind(c(phihat, phise ,phicpdr), c(lamhat, lamse ,lamcpdr), 

c(phat, pse,pcpdr), c(ybarhat,ybarse, ybarcpdr)), digits=4) 


# D ------------------------------------------------- 

# phi lam p ybar 

# 0.01570 1.84272 0.44049 0.45248 mean 
# 0.08792 0.08792 0.01408 0.01403 se 
#2.5% 0.01495 1.67553 0.41344 0.42555 LB 

# 97.5% 0.01656 2.01139 0.46602 0.47799 UB 


# Repeat above exactly from C to D but with N=400000. Results: 


# 0.007863 1.83516 0.44193 0.44792 
# 0.087755 0.08776 0.01375 0.01372 
#2.5% 0.007482 1.66809 0.41563 0.42160 
# 97.5% 0.008299 2.00048 0.46819 0.47409 


# Repeat above exactly from C to D but with N=40000. Results: 


# 0.07888 1.82895 0.44228 0.50220 
# 0.08162 0.08162 0.01359 0.01337 
# 2.5% 0.07538 1.66402 0.41490 0.47517 
# 97.5% 0.08278 1.99275 0.47007 0.52985 


# Repeat above exactly from C to D but with N=20000 and 15000 to produce 
# extra graphs. We omit the code for the case N = 15000, c=0.5 and the case 
#N=15000,c=1 


# (e) 

MH2 = function(J2100, n=9453, ysT=4941, alp=830, bet=1395, 
p0=0.5, phi0=0.1, lamO=1, psd=0.1, phisd=0.1, lamsd=0.1, 
eta-1,tau-1, del=1,gam=1, c=0.2, N=200000 ){ 

p=p0; phizphiO; lam=lam0; pv=p; phiv=phi; lamv=lam; pct=0; phict=0; 

lamct=0; 
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for(j in 1:J){ 


pnewzrnorm(1,p,psd) 
if((pnew »0)&&(pnew <1)){ 
logprobnum=(alp-1+ysT)*log(pnew)+(bet-1+n-ysT)*log(1-pnew) + 
(N-n)*log((1-pnew)*(1-phi)+pnew*(1-phi*lam)) 
logprobden=(alp-1+ysT)*log(p)+(bet-1+n-ysT)*log(1-p) + 
(N-n)*log((1-p)*(1-phi)+p*(1-phi*lam)) 
logprob= logprobnum- logprobden; prob=exp(logprob) 
u=runif(1); if(u<=prob){ pct=pct+1; p=pnew} } 
phinew=rnorm(1, phi, phisd) 
if((phinew>0)&&(phinew<c)){ 
logprobnum=(del-1+n)*log(phinew)+(gam-1)*log(1- phinew/c)+ 
(N-n)*log((1-p)*(1-phinew)+p*(1-phinew*lam)) 
logprobden=(del-1+n)*log(phi)+(gam-1)*log(1-phi/c)+ 
(N-n)*log((1-p)*(1-phi)+p*(1-phi*lam)) 
logprob= logprobnum- logprobden; prob=exp(logprob) 
u=runif(1); if(u<=prob){ phict=phict+1; phi=phinew } } 
lamnew=rnorm(1,lam,lamsd) 
if((lamnew>0)&&(lamnew<(1/c))){ 
logprobnum= (eta-1+ysT)*log(lamnew)+(tau-1)*log(1- lamnew*c)+ 
(N-n)*log((1-p)*(1-phi)+p*(1-phi*lamnew)) 
logprobden= (eta-1+ysT)*log(lam)+(tau-1)*log(1- lam*c)+ 
(N-n)*log((1-p)*(1-phi)+p*(1-phi*lam)) 
logprob= logprobnum- logprobden; prob=exp(logprob) 
u-runif(1); if(u<=prob){ lamct=lamct+1; lamzlamnew) } 
pv=c(pv,p); phivzc(phiv,phi); lamv=c(lamv,lam) } 


par=pct/J; phiar=phict/J; lamar=lamct/J 
list(pv=pv, phiv=phiv, lamv=lamv, par=par, phiar=phiar, lamar=lamar) } 
# end fn 


X11(w=8,h=6);  par(mfrowzc(2,2)) 
N=200000; n = 4299; ysT=2544; K=2000 
set.seed(531); reszMH2(J-K, n=4299, ysT=2544, alp=1, bet=1, 


pO=0.5, phiOz0.1, lamO-1, psd=0.008, phisd=0.0007, lamsd=0.04, 


eta= etanew, tau- taunew, del=1,gam=1, c=0.2, NzN) 
c(resSpar, resSphiar,resSlamar) # 0.6580 0.4135 0.6045 OK 
plot(resSpv); plot(resSphiv); plot(resSlamv) # Has burnt in OK 
pO-resSpv[2001]; lamO-resSlamv[2001]; phi0=resSphiv[2001] 

# record last values 
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set.seed(131); K=10000; par(mfrowzc(2,2)); date() # 
res=MH2(J=K, n=4299, ysT=2544, alp=1, bet=1, 
p0=p0, phiO-phiO, lamO=lamO, psd=0.008, phisd=0.0006, 
lamsd=0.04, 
eta= etanew, tau- taunew, del=1,gam=1, c=0.2, NzN ); date() # 
c(resSpar, resSphiar,resSlamar) # 0.6825 0.4315 0.6643 OK 
plot(resSpv); plot(resSphiv); plot(resSlamv) # OK 


skip=1; inc=1+seq(skip,K,skip); J=length(inc); J 
# 10000 (Just use whole sample) 
pv» resSpv[inc]; phiv=resSphiv[inc]; lamv-resSlamv[inc] 
par(mfrow=c(2,2)); hist(pv,prob=T); hist(phiv,probzT); hist(lamv,prob=T); 
# OK 


# Calculate estimates 

phat=mean(pv); pcpdr=quantile(pv,c(0.025,0.975)); pse=sd(pv) 
lamhat=mean(lamv); lamcpdr=quantile(lamv,c(0.025,0.975)); lamsezsd(lamv) 
phihat=mean(phiv); phicpdr=quantile(phiv,c(0.025,0.975)); phise=sd(lamv) 


# Generate sample from predictive dsn of finite population mean 
zv=pv*(1-phiv*lamv)/( pv*(1-phiv*lamv) + (1-pv)*(1-phiv) ) 
set.seed(331); yrTv = rbinom(J, N-n, zv); ybarv=(1/N)*(ysT+yrTv) 
ybarhat=mean(ybarv); ybarcpdrzquantile(ybarv,c(0.025,0.975)); 
ybarse=sd(ybarv) 


# Print out inferences 
print(cbind(c(phihat, phise ,phicpdr), c(lamhat, lamse ,lamcpdr), 
c(phat, pse,pcpdr), c(ybarhat,ybarse, ybarcpdr)), digits=4) 


H 0.01567 1.8491 0.43973 0.43973 
H 0.08660 0.0866 0.01387 0.01382 
# 2.596 0.01491 1.6844 0.41331 0.41346 
# 97.596 0.01650 2.0278 0.46689 0.46673 


RBest=mean(zv); RBci=RBest+c(-1,1)*qnorm(0.975)*sd(zv)/sqrt(J) 
c(RBest,RBci) # 0.43639 0.43612 0.43667 
(1/N)*(ysT+(N-n)*RBest) # 0.43973 

(1/N)*(ysT+(N-n)*RBci) # 0.43946 0.44000 
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Exercise A.l Practice with the Metropolis algorithm 


(a) Sample a value m from the standard exponential distribution. Then 
randomly sample n = 100 values from the normal distribution with mean 


m and variance v 2 m. 


Then design and implement a Metropolis algorithm so as to obtain a 
random sample of size J = 1,000 from the posterior of m. 


Use this sample to perform Monte Carlo inference on m. Be sure to 
provide a 9596 CI for the posterior mean of m, an estimate of the 9596 
central posterior density region for m, and an estimate of the entire 
marginal posterior density of m. 


Then predict c, the average of a future independent sample of size k = 10 
from the normal distribution with the same mean m and variance v. 


Be sure to provide a 9596 CI for the predictive mean of c, an estimate of 
the 9596 central predictive density region for c, and an estimate of the 
entire posterior predictive density of c. 


Illustrate your results with suitable figures (for example, trace plots and 
histograms). 


(b) Consider the following values in a sample obtained via SRSWOR 
from a finite population of size N = 50: 


3.4, 6.3, L0, 2.9, 1.8, — 2.0, 0.5, 7.9, 4.8, 6.5. 


Suppose we model the finite population values as normal with (unknown) 
mean m and variance v = m’, with a standard exponential prior on m. 


Using MCMC methods, estimate the finite population mean and provide 
a suitable 9596 interval estimate. 
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Solution to Exercise A.I 


(a) The sampled value of m was 0.7071. A histogram of the 100 sampled 
normal values is shown in Figure A.1(a) (page 612). This histogram is 
overlaid by the (known) normal distribution with mean m and variance 


v=m =0.5. 


The posterior density of m is 


f(m| y) ec f (m) f Cy | m) 


rr 1 
oce [je-m] 


"m 1« 
-e"m eps > ozm}. 
i=1 


So the log-posterior is 


1 n 
(y; -my . 
2m? 2 


l(m) = log f (m| y)  -m-nlogm- 


A suitable Metropolis algorithm is one which at each iteration proposes a 
value 
m'~U(m-6,m+6), 
where ó is a tuning constant, and accepts this value with probability 
p-e, 
where 
q - (m) - (m). 


Implementing this algorithm we obtained the 10,100 values of m, whose 
trace is shown in Figure A.1(b) (page 612). Stochastic convergence 
appears to have been attained immediately, and so the burn-in was 
conservatively taken to be 100. 


The last 10,000 of these 10,100 values are highly autocorrelated, as 
evidenced by the sample ACF in Figure A.1(c) (page 612). However, 
thinning out by a factor of 10 removes almost all of the autocorrelation, 
as seen in the sample ACF in Figure A.1(d) (page 612), and yields the 
required random sample 

mim, ~iid f(m|y), 
where J = 1,000. 
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A histogram of these 1,000 values of m is shown in Figure A.1(e). 


The dashed line in this subplot is a histogram estimate of f (m| y) , and 


the solid line is the true posterior density. The vertical lines show the 
posterior mean estimate, m = 0.7377, the 95% CI for the posterior mean, 
(0.7350, 0.7404), and the 9596 CPDR estimate for m, (0.6620, 0.8298). 


The dots show the true posterior mean, m= E(m |y) = 0.7393, and the 
true 9596 CPDR for m. The cross shows the true value of m, 0.7071. 


The Monte Carlo sample was used to generate a random sample from the 
predictive distribution of 
C= (Yu t Y49)/ 10 
by sampling 
c; ~ N(m,,m; /10), j =1,...,J. 
A histogram of these c-values is shown in Figure A.1(f). 


The dashed line in this subplot is a histogram estimate of f (c| y), and 


the solid line is the Rao-Blackwell estimate 
J 


= 1 1 1 2 
MINS D ml am? my 


j=l 4j 


The vertical lines show the predictive mean estimate, c = 0.741, the 9596 
CI for the predictive mean, (0.7270, 0.7549), and the 9596 CPDR estimate 
for c, (0.3063, 1.1893). 


The dot shows the Rao-Blackwell estimate of € = E(c| y), which is the 
same as m — 0.7377. 


The Rao-Blackwell 95% CI for ĉ is the same as the 95% CI (0.7350, 
0.7404) reported earlier. 


Bayesian Methods for Statistical Analysis 


Figure A.| Graphical results for part (a) 
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(b) Here we repeat the procedure in part (a), but: 
* with n = 10 (rather than 100) 


* using the 10 given sample values, whose mean is 3.71 
(instead of the 100 generated values, as previously) 


; 1 1 
* with c= 400 +...+ Ys) (instead of c= 1p Om Tent: ua Jk 


Figure A.2 is an analogue of Figure A.1, except that subplot (a) does not 
have a normal density overlaid, and there is an extra subplot (g) that shows 
inference on the finite population mean, which may be denoted here by 


a= + ao x 3.714- 40c). 
50 
Figure A.2 Graphical results for part (b) 


(a) Histogram of 10 y-values (b) Trace of 10100 m-values 
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(e) Histogram of 1000 m-values 
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(f) Histogram of 1000 c-values 
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Some of the estimates and quantities shown in the last subplot (g) are as 
follows. The histogram estimate of a's predictive mean is a = 3.061 with 
9596 CI (3.028, 3.094). The Rao-Blackwell estimate of a's predictive 
mean is (10x 3.71-- 40m) / 50 = 3.055, with 95% CI (3.031, 3.078). The 
exact predictive mean of a is the same as the posterior mean of m and 
equal to 3.068. The 9596 CPDR estimate for a is 2.190 4.256. 
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R Code for Exercise A.I 


# (a) 

options(digits=4) 

INTEG <- function(xvec, yvec, a = min(xvec), b = max(xvec)){ 

# Integrates numerically under a spline through the points given by 
# the vectors xvec and yvec, from a to b. 

fit <- smooth.spline(xvec, yvec) 

spline.f <- function(x){predict(fit, x)Sy } 

integrate(spline.f, a, b)Svalue } 

INTEG(seq(0,1,0.01), seq(0,1,0.01)^2, 0,1) #0.3333 correct 


X11(w=8,h=6); par(mfrowzc(2,2)); 
set.seed(221); m=rgamma(1,1,1); vzm^2; n=100; y=rnorm(n,m,m); c(m,v) 
# 0.7071 0.5000 
hist(y, prob=T,xlim=c(-2,4),ylim=c(0,0.8), breaks=seq(-2,4,0.25), 
main="(a) Histogram of 100 y-values") 
yvec=seq(-2,4,0.01); lines(yvec,dnorm(yvec,m,m),lwd=3) 
abline(v=c(m,m+c(-1,1)*qnorm(0.975)*m), lwd=3) 


LOGPOST=function(m=2,n=10,y=c(2,1)){ 
-m-n*log(m)-(1/(2*m42))*sum((y-m)*2) } 
LOGPOST() #-9.056 OK 


METALG = function(J=1000,y,m0=1,mdel=0.4){ 
m=m0; mv=m; mct=0; n=length(y); for(j in 1:J){ 
mcand=runif(1,m-mdel,m+mdel) 
if(mcand>0){ logprob=LOGPOST(m= mcand,n=n,y=y)- 
LOGPOST(m=m,n=n,y=y) 
prob=exp(logprob) 
u-runif(1); if(u<=prob){ mct=mct+1; m= mcand } 
} 
mv=c(mv,m) 
} 


list(mv=mv,mar=mct/J) ) 


set.seed(312); res=METALG(J=10100,y=y,m0=1,mdel=0.11); res$mar # 0.5528 
plot(resSmv,type="I",main="(b) Trace of 10100 m-values"); 


acf(res$mv, mainz"(c) Sample ACF of 10000 m-values") 
acf(resSmv, plot=F)[1:5] # 0.628 0.404 0.259 0.157 0.100 
mv=resSmv[-(1:101)][seq(10,10000,10)]; 

acf(mv, main="(d) Sample ACF of 1000 m-values") 
acf(mv, plot=F)[1:5] # -0.014 -0.001 0.006 0.018 0.014 
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J=length(mv); J # 1000 


mbar=mean(mv); mci=mbar+c(-1,1)*qnorm(0.975)*sd(mv)/sqrt(J) 
mcpdrzquantile(mv,c(0.025,0.975)); 
mvec=seq(0.5,1,0.01); kvec=mvec; 

for(i in 1:length(mvec)) kvec[i] = exp(LOGPOST(m=mvec[i],n=n,y=y)) 
kO=INTEG(mvec,kvec); postvec=kvec/k0O; kO #6.269e-11 
mhatzINTEG(mvec,mvec*postvec); 
c(mbar,sd(mv),mhat,mci,mcpdr) 

4 0.73769 0.04305 0.73935 0.73502 0.74036 0.66197 0.82984 


fun=function(q,p=0.025){ (INTEG(mvec,postvec,0,q)-p)^2 } 

LBO = optim(par=0.5,fn=fun)Spar; LB = optim(par= LBO,fn=fun)Spar 
fun=function(q,p=0.975){ (INTEG(mvec,postvec,0,q)-p)*2 } 

UBO = optim(par=0.8,fn=fun)Spar; UB = optim(par= UBO,fn=fun)Spar 
c(LB,UB) # 0.6609 0.8305 

INTEG(mvec,postvec,O,LB) # 0.025 

INTEG(mvec,postvec,UB,1) # 0.025 OK (Ignore all the warnings) 


par(mfrow=c(2,1)) 

hist(mv,prob=T,xlim=c(0.6,0.9),ylim=c(0,10), breaks=seq(0.5,1,0.01), 
xlab="x",main="(e) Histogram of 1000 m-values") 

lines(mvec,postvec,|ty=1,lwd=3) 

lines(density(mv), lty=2,lwd=3) 

abline(v=c(mbar,mci,mcpdr),lwd=2) 

points(c(mhat,LB,UB),c(0,0,0), ch=16) 

points(m,0,pch=4,lwd=3) 


# Prediction of c ----------------------- 

set.seed(332); cv=rnorm(J,mv,mv/sqrt(10)) 

cbar=mean(cv); cci=cbar+c(-1,1)*qnorm(0.975)*sd(cv)/sqrt(J) 
ccpdr=quantile(cv,c(0.025,0.975)) 

c(cbar,sd(cv),cci,ccpdr) # 0.7410 0.2253 0.7270 0.7549 0.3063 1.1893 


hist(cv,prob=T,xlim=c(0,1.6),ylim=c(0,2.5), breaks=seq(0,1.6,0.05), 
xlab="c",main="(f) Histogram of 1000 c-values") 

cvec=seq(0,1.5,0.01); fcveczseq(0,1.5,0.01); for(i in 1:length(cvec)) 
fevec[i]J=mean(dnorm(cvec[i],mv,mv/sqrt(10))) 

lines(cvec,fcvec,Ity=1,lwd=3) 

lines(density(cv),lty=2,lwd=3) 

abline(v=c(cbar,cci,ccpdr),lwd=2) 

points(mhat,0,pch=16) 
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# (b) 

X11(w=8,h=6); par(mfrowzc(2,2)); 

y = c(3.4, 6.3, 1.0, 2.9, 1.8, 2.0, 0.5, 7.9, 4.8, 6.5); n = 10; ybar=mean(y); 

ybar 4 3.71 

hist(y, prob=T,xlim=c(0,10),ylim=c(0,0.6), breaks=seq(0,10,0.5), 
main="(a) Histogram of 10 y-values") 


set.seed(312); res=METALG(J=10100,y=y,m0=1,mdel=1); resSmar tt 0.5954 
plot(resSmv,type="I",main="(b) Trace of 10100 m-values"); 

acf(res$mv, mainz"(c) Sample ACF of 10000 m-values") 
acf(resSmv,plot=F)[1:5] # 0.710 0.513 0.374 0.270 0.195 

acf(mv, mainz"(d) Sample ACF of 1000 m-values") 
mv=resSmv[-(1:101)][seq(10,10000,10)]; 

acf(mv,plot=F)[1:5] #0.056 0.001 -0.006 -0.027 0.035 

J=length(mv); J # 1000 


mbar=mean(mv); mci=mbar+c(-1,1)*qnorm(0.975)*sd(mv)/sqrt(J) 
mcpdrzquantile(mv,c(0.025,0.975)); 
mvec=seq(1.8,5,0.01); kvec=mvec; 

for(i in 1:length(mvec)) kvec[i] = exp(LOGPOST(m=mvec[i],n=n,y=y)) 
kO=INTEG(mvec,kvec); postvec-kvec/k0; kO #3.317e-08 
mhatzINTEG(mvec,mvec*postvec); 
c(mbar,sd(mv),mhat,mci,mcpdr) 

# 2.8907 0.4823 2.9071 2.8608 2.9206 2.1456 3.9827 


fun=function(q,p=0.025){ (INTEG(mvec,postvec,1.8,q)-p)^2 } 

LBO = optim(par=2.1,fn=fun)Spar; LB = optim(par= LBO,fn=fun)Spar 
fun=function(q,p=0.975){ (INTEG(mvec,postvec,1.8,q)-p)^2 } 

UBO = optim(par=4.1,fn=fun)Spar; UB = optim(par= UBO,fn=fun)Spar 
c(LB,UB) # 2.143 4.033 

INTEG(mvec,postvec,1.8,LB) # 0.025 

INTEG(mvec,postvec,UB,5) #0.025 OK (Ignore all the warnings) 


par(mfrow=c(2,1)) 

hist(mv,prob=T,xlim=c(1,5),ylim=c(0,1), breaks=seq(1,5,0.2), 
xlab="x",main="(e) Histogram of 1000 m-values") 

lines(mvec,postvec,|ty=1,lwd=3) 

lines(density(mv), lty=2,lwd=3) 

abline(v=c(mbar,mci,mcpdr),lwd=2) 

points(c(mhat,LB,UB),c(0,0,0), pch=16) 

points(m,0,pch=4,lwd=3) 
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# Prediction of c = (1/40)(y11+...ty50) (new definition) ----------------------- 
set.seed(332); cv=rnorm(J,mv,mv/sqrt(40)) 

cbar=mean(cv); cci=cbar+c(-1,1)*qnorm(0.975)*sd(cv)/sqrt(J) 
ccpdr=quantile(cv,c(0.025,0.975)) 

c(cbar,sd(cv),cci,ccpdr) # 2.8985 0.6594 2.8577 2.9394 1.8105 4.3925 


hist(cv,prob=T,xlim=c(1,6), ylim=c(0,0.7), breaks=seq(1,6,0.25), 
xlab="c",main="(f) Histogram of 1000 c-values") 

cvec=seq(1,6,0.01); fevec=seq(1,6,0.01); for(i in 1:length(cvec)) 
fevec[i]J=mean(dnorm(cvec[i],mv,mv/sqrt(40))) 

lines(cvec,fcvec,lty=1,lwd=3) 

lines(density(cv),lty=2,lwd=3) 

abline(v=c(cbar,cci,ccpdr),lwd=2) 

points(mhat,0,pch=16) 


# Now perform inference on the finite population mean, 

# a=(1/50)*(10*ybar +40*c) 

av=(1/50)*(10*ybar+40*cv) 

abar=mean(av); acizabar+c(-1,1)*qnorm(0.975)*sd(av)/sqrt(J) 
acpdr=quantile(av,c(0.025,0.975)) 

c(abar,sd(av),aci,acpdr) # 3.0608 0.5276 3.0281 3.0935 2.1904 4.2560 
(1/50)*(10*ybar-40*mbar) # 3.055 RB estimate of predictive mean of a 
(1/50)*(10*ybar-40*mci) # 3.031 3.078 RB CI for predictive mean of a 
(1/50)*(10*ybar-40*mhat) #3.068 Exact predictive mean of a 


X11(w=8,h=4); par(mfrow=c(1,1)) 

hist(av, prob=T,xlim=c(1.5,5.5), ylim=c(0,1), breaks=seq(1,6,0.2), xlabz"c", 
main="(g) Histogram of 1000 a-values (finite population mean)") 

avec=seq(1,6,0.01); favec=seq(1,6,0.01); for(i in 1:length(avec)) 
favec[i]- 

mean( dnorm( avec[i], (1/50)*( 10*ybar-40*mv), mv*sqrt(40)/50 ) ) 

lines(avec,favec,Ityz1,Iwdz3); lines(density(av),lty=2,lwd=3) 

abline(v=c(abar,aci,acpdr),lwd=2) 

points( (1/50)*(10*ybar+40*mbar) ,0.1,pch=1,cex=1, lwd=2) 

points( (1/50)*(10*ybar+40* mci) ,c(0.06,0.14), pch=1,cex=1, lwd=2) 

points( (1/50)*(10*ybar+40*mhat) ,0,pch=4,lwd=2,cex=2) 

points(ybar,0,cex=1,lwd=2,pch=16) 

legend(3.9,1, c("Histogram density estimate","Rao-Blackwell estimate"), 
Ity=c(2,1), lwd=c(3,3), bg="white") 

legend(3.83,0.67,c("Sample mean","Rao-Blackwell estimate & 95% CI", 
"Exact predictive mean"), 
pch=c(16,1,4), pt.cex=c(1,1,2), pt.lwd=c(2,2,2), bg="white") 
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Exercise A.2 Practice with the MH algorithm 


(a) Sample a value a from the standard exponential distribution and a 
value b from the uniform distribution between 0 and 10 (independently). 


Then randomly sample n = 100 values from the gamma distribution with 
mean m — a/ b and variance v=a/b’. 


Then design and implement a Metropolis-Hastings algorithm so as 
to generate a random sample of size J —1,000 from the joint posterior 
distribution of a and b. 


Use this sample to perform Monte Carlo inference on m. 


Be sure to provide a 9596 CI for the posterior mean of m, an estimate of 
the 9596 central posterior density region for m, and an estimate of the 
entire marginal posterior density of m. 


Then predict c, the average of a future independent sample of size k = 10 
from the gamma distribution with the same mean m and variance v. 


Be sure to provide a 9596 CI for the predictive mean of c, an estimate of 
the 9596 central predictive density region for c, and an estimate of the 
entire posterior predictive density of c. 


Illustrate your results with suitable figures (e.g. trace plots and 
histograms). 


(b) Consider the following values in a sample obtained via SRSWOR 
from a finite population of size N = 30: 
0.4, 3.3, 1.0, 2.9, 1.8, 4.1. 


Suppose we model the finite population values as gamma with mean 


m=a/b and variance v = a / b^, with a standard exponential prior on m 
and a uniform prior on b between 0 and 10. 


Using MCMC methods, estimate/predict the finite population mean 
absolute deviation about the superpopulation mean, equivalently referred 
to as the MAD for short, and defined by 


1 N 
p 
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Solution to Exercise A.2 


The sampled values of a and b were 1.463 and 5.528. So the value of m 
was a/b = 0.2647. The 100 sampled gamma values are shown in Figure 
A.3(a) (page 621). 


Next, the posterior density of the two parameters a and b is 


f(a,b| y) « f (a,b) f (y |a, b) 
» n Dye Exit p) em 
à TO — ra)" | 


So the log-posterior is 


I(a,b) =log f(a,b| y) 


— —a- nalog b (a -1)» log y, —by, —nlogI(a). 


i=1 
A suitable Metropolis algorithm is one which at each iteration: 


1. Proposes a value 
a'~U(a-ô,„a+ô,), 
where ô, is a tuning constant, and accepts this value with 


probability p =e", where q = l(a';b) — l(a,b) 


2. Proposes a value 
b'~ U(b-6,,b+6,), 
where ô, is a tuning constant, and accepts this value with 


probability p =e", where q = l(a,b^) — I(a,b) . 


Implementing this algorithm we obtained the required J = 1,000 values 
(a,,b,),....(a,,b,) ~tid f(a,b| y) 

and hence 
m,,...,m, ~ iid f(m| y) 

by calculating m; — a; / b; for eachj = 1,...,J. 


A histogram of these simulated m-values is shown in Figure A.3(b) (page 
622). 
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The dashed line is a histogram estimate of f(m |y). The vertical lines 


show the posterior mean estimate, m = 0.3017, the 95% CI for the 
posterior mean, (0.3001, 0.3033), and the 9596 CPDR estimate for m, 
(0.2566, 0.3570). The cross shows the true value of m, 0.7071. 


The Monte Carlo sample was then used to generate a random sample from 
the predictive distribution of 


CHA (Vig t+ y as) 10, 


This was done by sampling 
ysy ~ iid G(aj,b) 
and forming 
69591 53715) /40, Fa ce, 


A histogram of the c-values is shown in Figure A.3(c). The dashed line is 
a histogram estimate of f(c|y). The vertical lines are the predictive 


mean estimate, C = 0.2981, the 95% CI for the predictive mean, (0.2929, 
0.3033), and the 9596 CPDR estimate for c, (0.1584, 0.4878). 


Figure A.3 Graphical results for part (a) 


(a) Histogram of 100 y-values 
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(b) Histogram of 1000 m-values 
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(b) Here we repeat the procedure in (a) but using n = 6 (rather than 100), 
and the 6 given sample values whose mean is 2.25 (instead of the 100 
generated values as before), so as to generate a Monte Carlo sample of 
size J = 1,000 from the posterior distribution of a and b. 


We then use each pair of values, a ; and b; , to generate 24 values which 


are iid from the gamma distribution with parameters a; and b,. 


Then for each j we calculate the associated value of the MAD, namely 


We then use the resulting J values of the MAD, i.e. y,,....w,, for Monte 
Carlo inference in the usual way. 
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Figure A.4 shows a histogram of these J values and related information. 


Numerically, we estimate y 's posterior/predictive mean by 1.307 with 
95% CI (1.27, 1.34), and we estimate y ’s CPDR by (0.75, 2.73). 


Figure A.4 Histogram of 1,000 MAD values 
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R Code for Exercise A.2 


# (a) 

options(digits=4); n = 100; X11(w=8,h=4); par(mfrowzc(1,1)); 
set.seed(192); a=rgamma(1,1,1); b=runif(1,0,10); y=rgamma(n,a,b); 
mza/b; vza/b^2; c(a,b,m,v) # 1.46321 5.52763 0.26471 0.04789 


hist(y,probzT,xlimzc(0,1.5),ylimzc(0,3), breaks=seq(0,1.5,0.05), 

mainz"(a) Histogram of 100 y-values") 
yvec=seq(0,1.5,0.01); lines(yvec,dgamma(yvec,a,b),lwd=3) 
abline(v=m,lwd=3) 


sumlogy=sum(log(y)); sumyzsum(y) # sufficient statistics 
LOGPOST=function(a=1,b=1,n=3,sumlogy=2,sumy=2){ 

-at+n*a*log(b)+(a-1)*sumlogy-b*sumy-n*lgamma(a) } 
LOGPOST() #-3 OK 
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MHALG = function(J=1000,y,a0=1,b0=1,adel=1,bdel=1){ 
a=a0; b=b0; av=a; bv=b; act=0; bct=0; n=length(y); 
sumlogy=sum(log(y)); sumyzsum(y) # sufficient statistics 
for(j in 1:J){ 
acand=runif(1,a-adel,at+adel) 
if(acand>0){ 
logprob= 
LOGPOST (a=acand,b=b,n=n,sumlogy=sumlogy,sumy=sumy)- 
LOGPOST (a=a,b=b,n=n,sumlogy=sumlogy,sumy=sumy) 
prob=exp(logprob) 
u=runif(1); if(u<=prob){ act=act+1; a= acand). ) 
bcand=runif(1,b-bdel,b+bdel) 
if((bcand>0)&&(bcand<10)){ 
logprob= 
LOGPOST (a=a,b=bcand,n=n,sumlogy=sumlogy,sumy=sumy)- 
LOGPOST (a=a,b=b,n=n,sumlogy=sumlogy,sumy=sumy) 
prob=exp(logprob) 
u=runif(1); if(u<=prob){ bct=bct+1; b= bcand } 
} 
av=c(av,a); bv=c(bv,b) 
} 
list(av=av, bv=bv,aar=act/J,bar=bct/J) 


} 


set.seed(312); res=MHALG(J=10100,y=y,a0=1,b0=1,adel=0.3,bdel=1) 
X11(w=8,h=6); par(mfrow=c(2,1)); 
plot(resSav); plot(resSbv); c(resSaar,resSbar) # 0.5055 0.5611 


av=resSav[-(1:101)][seq(10,10000,10)]; J=length(av); J # 1000 
bv=resSbv[-(1:101)][seq(10,10000,10)]; mv=av/bv 
mbar=mean(mv); mci=mbar+c(-1,1)*qnorm(0.975)*sd(mv)/sqrt(J) 
mcpdr=quantile(mv,c(0.025,0.975)); 

c(mbar,mci,mcpdr) # 0.3017 0.3001 0.3033 0.2566 0.3570 


X11(w=8,h=4); par(mfrowzc(1,1)); 

hist(mv,prob=T,xlim=c(0.2,0.4),ylim=c(0,20), breaks=seq(0.2,0.4,0.005), 
xlabz"m",mainz"(b) Histogram of 1000 m-values") 

lines(density(mv),Ityz1,Iwd-3) 

abline(v=c(mbar,mci,mcpdr),lwd=2) 

points(m,0,pch=4,lwd=3) 


624 


Appendix A: Additional Exercises 


# Prediction of c ----------------------- 

set.seed(332); cv=rep(NA,J); for(j in 1:J) cv[jjzmean(rgamma(10,av[j], bv[j])) 

cbar=mean(cv); cci=cbar+c(-1,1)*qnorm(0.975)*sd(cv)/sqrt(J) 

ccpdr=quantile(cv,c(0.025,0.975)) 

c(cbar,sd(cv),cci,ccpdr) # 0.29812 0.08356 0.29294 0.30329 0.15843 0.48783 

hist(cv, prob=T,xlim=c(0.05,0.7),ylim=c(0,7), breaks=seq(0,1.6,0.02), 
xlab="c",main="(c) Histogram of 1000 c-values") 


lines(density(cv),Ityz1,Iwdz3); abline(v=c(cbar,cci, ccpdr),lwd=2) 


# (b) 

y=c( 0.4, 3.3, 1.0, 2.9, 1.8, 4.1); X11(w=8,h=6); par(mfrow=c(2,1)); 
n=length(y); sumlogy=sum(log(y)); sumyzsum(y) # sufficient statistics 
set.seed(312); res=MHALG(J=10100,y=y,a0=1,b0=1,adel=1.3,bdel=0.7) 
plot(resSav); plot(resSbv); c(resSaar,resSbar) # 0.5129 0.5094 


av=resSav[-(1:101)][seq(10,10000,10)]; J=length(av); J # 1000 
bv=resSbv[-(1:101)][seq(10,10000,10)]; mv=av/bv 
mbar=mean(mv); mci=mbar+c(-1,1)*qnorm(0.975)*sd(mv)/sqrt(J) 
mcpdrzquantile(mv,c(0.025,0.975)); 

c(mbar,mci,mcpdr) # 2.256 2.208 2.305 1.148 4.188 


X11(w=8,h=4); par(mfrowzc(1,1)); 
hist(mv,prob=T,xlim=c(0,7),ylim=c(0,0.8), breaks=seq(0,10,0.5), 


xlab="x",main="Histogram of 1000 simulated m-values") 
lines(density(mv),lty=2,lwd=3); abline(v=c(mbar,mci,mcpdr),lwd=2) 


# Prediction of psi ----------------------- 
set.seed(332); psiv=rep(NA,J); 
for(j in 1:J){ yrem-rgamma(24,av[j],bv[j]) 

yall = c(y,yrem); psiv[j]|zmean((abs(yall-mv[j]) )) ) 
psibar=mean(psiv); psici =psibar+c(-1,1)*qnorm(0.975)*sd(psiv)/sqrt(J) 
psicpdr=quantile(psiv,c(0.025,0.975)) 
c(psibar,sd(psiv),psici,psicpdr) # 1.3068 0.5411 1.2732 1.3403 0.7497 2.7349 


hist(psiv, prob=T,xlim=c(0,4),ylim=c(0,1.5), breaks=seq(0,7,0.1), 


xlab="psi",main="") 
lines(density(psiv), Ity=1,lwd=3); abline(v=c(psibar,psici, psicpdr),lwd=2) 
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Exercise A.3 Practice with a Bayesian finite population 
regression model 


(a) Generate a population of covariates 
Xo 57 Id U(10,20), 
where N - 100. 


Then generate a population of values 
y," N(ac-bx,o^), i21... N, 
where a = 3,b=0.5, o=2. 


Then select a random sample of size n = 20 from the N units in the finite 
population, without replacement. 


Plot the y values against the x values, over the population and over the 
sample, respectively. Draw the true regression line y=a+bx and the 
two least squares regression lines estimated using the population data and 
sample data, respectively. 


(b) Consider the following Bayesian model: 
(y. |g,D, A) «E N(a+bx,1/4), i=1,...,N 
f(a,b,A4)*«1/4; a,beR; A»0. 


Generate a random sample of size J = 1,000 from the joint posterior 
distribution of a, b and 4 , given the sample data generated in (a). 


Then use this sample and R to estimate each of the following quantities: 
m —a 16b (average of a hypothetically infinite number of 
values with covariate 16) 


Pak 
_ Tm (the finite population mean) 


<| 


2Y 100) 
J(so) + Yey 


(ratio of maximum to median of the 100 finite 
population values). 

Assume that all N covariate values in the population are known. 

(c) Repeat the inferences in (b) but using WinBUGS and a sample size of 


J = 10,000. 
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Solution to Exercise A.3 


(a) The required plot and regression lines are shown in the Figure A.5. 


Figure A.5 Graphical results for part (a) 


——7 True regression line 

-— - Estimate from population 
--*- Estimate from sample 
-— Estimate from nonsample 


(b) Denote the sample values by s,,...,s, € {1,..., N}, where s, «...« Sp, 
and define s =(5,,...,S,). 


Then define the population vector as y 2(y,,..,y,) and the sample 


vector as y, - (y, .... Y; ). 


Also define r = (r... ry) ={L....N}—s in such a way that f €... € ry ,, 
and define the nonsample vector as y, = (y,,... y, )'. 


Likewise, define the population covariate vector as x = (x,,..., X, )' , the 
sample covariate vector as x, — (x, ,..., x, )' , and the nonsample covariate 


vector as x, — [X oen. y. 
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Also consider all of x,,..., x, as known constants, and define D = (s, y,) 
as the data. Also let: 


a 
6-3) X,=(1,,%,), X= (ly ,,X,), LEM Le = Iya 


Then, from the theory of the normal-normal-gamma finite population 
model, we have that: 


(Y, |D, 8:4) ~ Ny (XB, 1 À) 
(B|D,A)~N,(T,D/A), 

where D=(X/D'X,)* and T2 (XIX1X,) ! X'EJy, 
(4|D) * G(A/ 2, B/2), 

where A-n-2 and B - (y, - X,T)X,(y, - XT). 


Thus, to do the required inference, first carry out the following steps: 
1. Relabel the population units so that y, — (y,,..., y,)', 

X, = Qt). Y, T (hacen) s Xp = (Riou Xy) , Ct, 
so that y — ( y;, y;) , etc. 

. Calculate A, B, D and T as per the above 

. Generate 4,,...,4, ^ iid G(A/2, B/2) (easy) 

. Generate 8” ~L N,(T,D/A,), forj =1,...,J (easy) 

. Generate y®,..., y? ~ Ny_.(X,8,2,,/4,), forj 2 1,...,J 


ao AR WN 


(e.g. for each j, generate y” ~L N(a,+b,x,,1/2,), 
i2nc1..,N , and form y? 2 (yO),..., yP 


6. Form y? =(y!, yY for each j = 1,...,J. 


Now calculate 
m, —a, * 16b, 

and perform Monte Carlo inference on m, using the fact that 
fm, ~iid Tos D). 


(For example, estimate m by i= J^ 27; ,m;.) 


Likewise, calculate y*? 21, y? / N and perform Monte Carlo inference 


on y in the usual way, using the fact that y ,..., y^ ~ iid f(y|D). 
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Finally, calculate 
G) 
TAR 2 Yáooy 
j G) G) 
Yeo) T Yë) 


and perform Monte Carlo inference on y , using the fact that 
Vis, lid f(y |D). 


Optionally, we may improve on some of the above ‘basic’ inferences by 
considering Rao-Blackwell techniques, e.g. estimate m by its exact 
posterior mean, m=E(m|D)=(1,16)T . 


Figure A.6 shows histograms of the simulated values of m (subplot (a)), 
y (subplot (b)) and w (subplot (c)), with each subplot overlaid by 
various points, interval and density estimates. 


Subplot (d) (page 631) illustrates ‘exact’ inference on y based on the 
theory of the normal-normal-gamma finite population model, and subplot 
(e) (page 631) is a detail in subplot (d). Each plot features a cross showing 
the true value of the quantity being estimated. 


Figure A.6 Graphical results for part (b) 


(a) Histogram of 1000 m-values 
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(b) Histogram of 1000 ybar-values 
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(c) Histogram of 1000 psi-values 
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(d) Histogram of 1000 ybar-values 
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Table A.1 shows some of the true values and corresponding numerical 
estimates featuring in Figure A.6. 


(c) Using the WinBUGS code below we obtained results as shown in 


Figure A.7. It will be noted that these are consistent with those in Table 
AT 
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Table A.I Numerical results for part (b) 


Quantity True Posterior MC 9596 CI for MC estimate 
value mean estimate post. mean of 95% CPDR 
m 11.000 10.895 10.906 (10.875, 10.937) (9.893, 11.863) 
y 10.473 10.174 10.185 (10.158, 10.211) (9.353, 11.049) 
yY 1.435 NA 1.659 (1.650, 1.668) (1.444, 2.014) 


Figure A.7 Output from WinBUGS run 


28: Node statistics = |] 
node mean sd MC error 2.5% median 97.5% start sample 
a -0.7159 2.805 0.02277 -6.23 -0.725 4.982 1001 10000 
b 0.7253 0.184 0.001433 0.3562 0.7253 1.085 1001 10000 
lam 0.2451 0.08178 9.318E-4 0.1133 0.2366 0.4316 1001 10000 
m 10.89 0.5144 0.004931 9.868 10.88 11.92 1001 10000 
psi 1.663 0.1468 0.001506 1.442 1.641 2.006 1001 10000 
ybar 10.16 0.4326 0.004365 9.318 10.16 11.03 1001 10000 

a sample: 10000 b sample: 10000 
30r 
20r 
10r 
oog T T T T T T 
-05 00 05 10 15 
m sample: 10000 
10r 
075Lr 
05r 
025r 
oor 
8.0 10.0 12.0 


ybar sample: 10000 
10r 


05r 


8.0 10.0 12.0 
psi 

25 
20 
15 
10 
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1 5000 10000 
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R Code for Exercise A.3 


# (a) 

X11(w=8,h=5.5); — par(mfrow=c(1,1)); options(digits=4) 

N=100; n=20; a=3; b=0.5; sig=2; set.seed(312); x=runif(N,10,20); 
y=rnorm(N,a+b*x,sig); s=sort(sample(1:N,n)); xs=x[s]; ys=y[s]; 
r=(1:N)[-s]; xr=x[r]; yr=y[r]; yT=sum(y); ysT=sum(ys); yrT=sum(yr) 
ybar=mean(y); ysbar=mean(ys); yrbar=mean(yr); 

xT=sum(x); xsT=sum(xs); xrT=sum(xr) 

xbar=mean(x); xsbar=mean(xs); xrbar=mean(xr); 


m=a+16*b; psi=max(y)/median(y) 
c(m, ybar,max(y),median(y),psi) # 11.000 10.473 15.234 10.616 1.435 


plot(x,y,xlim=c(0,20),ylim=c(0,17)); 

points(xs,ys,pch=16); abline(v=0,Ity=3); abline(hz0,Ityz3); abline(v=16, Ity=3); 

abline(h=a+16*b,|ty=3); 

abline(a,b,lwd=3); 

abline(Im(y^x),Ityz2,Iwdz3); abline(Im(ys"xs),Ityz3,Iwdz3); 

abline(Im(yr~xr),lty=4,lwd=3) 

legend(0,17,bg="white", c("True regression line","Estimate from population", 
"Estimate from sample","Estimate from nonsample"), 
Ity=1:4,lwd=rep(3,4) ) 

text(16,2,"The solid dots show the sample values") 


# (b) Follows on from (a).... 
# Packages, Load package, MASS (for use further down) 


eta=0; tau=0; sigma=diag(rep(1,N)); sigmass=diag(rep(1,n)); 
sigmarr=diag(rep(1,N-n)); 

p=2; c=2*eta+n-p; Xs=cbind(1,xs); Xrzcbind(1,xr); X=rbind(Xs,Xr) 
D=solve(t(Xs)%*%solve(sigmass)%*%Xs) 

T=D%*%t(Xs)%* %solve(sigmass)%*%ys; t(T) # -0.6637 0.7224 

A-2*eta*n-p; B=2*tau+ t(ys-Xs%*%T) %*% solve(sigmass) %*% (ys-Xs96*96T) 


J=1000; set.seed(5); lamvec=rgamma(J,A/2,B/2); 
betamat=matrix(NA,nrow=2,ncol=J) 
for(j in 1:J) betamat|[,j] = mvrnorm( n=1, mu=T, Sigma-D/lamvec[j] ) 


avec-betamat[1,]; bvec=betamat[2, | 
ahat=mean(avec); bhatzmean(bvec); c(ahat,bhat) # -0.5742 0.7175 
yrmat=matrix(NA,nrow=N-n,ncol=J) 
set.seed(334); for(j in 1:J) 
yrmat[,j]| rnorm(N-n,avec{[j]+bvec[j]*xr,1/sqrt(lamvec[j])) 
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# Use simulated values of beta and yr to do inference 
mvec=avec+16*bvec; ybarvec=rep(NA,J); psivec=rep(NA,J) 
for(j in 1:J){ ysim = c(ys, yrmatL,j]) 
ybarvec|j]=mean(ysim) 
psivec[j] = max(ysim)/median(ysim) } 
mhat=mean(mvec); mci= mhat +c(-1,1)*qnorm(0.975)*sd(mvec)/sqrt(J) 
mcpdr=quantile(mvec,c(0.025,0.975)) 
ybarhat=mean(ybarvec); 
ybarci = ybarhat +c(-1,1)*qnorm(0.975)*sd(ybarvec)/sqrt(J) 
ybarcpdr=quantile(ybarvec,c(0.025,0.975)) 
psihat=mean(psivec); psici = psihat *c(-1,1)*qnorm(0.975)*sd(psivec)/sqrt(J) 
psicpdr=quantile(psivec,c(0.025,0.975)) 


hist(mvec,prob=T,xlim=c(8,14),ylim=c(0,1), breaks=seq(7,14,0.25), 
xlab="m",main="(a) Histogram of 1000 m-values") # Ignore warnings 
lines(density(mvec),Ityz2,Iwdz3) # Histogram estimate 
abline(v2c(mhat,mci,mcpdr),Ityz2,Iwdz3) # Histogram estimates 
mhat2=c(1,16)%*%T; points(mhat2,0, pch=16,cex=1.5) # Exact posterior mean 
mvarterm2=c(1,16)%*%D%*%c(1,16); msdterm2=sqrt(mvarterm2) 
mv=seq(6,16,0.05); fmv2=mv 
for(k in 1:length(mv)) 
fmv2[k]2mean(dnorm(mv[k],mhat2,msdterm2/sqrt(lamvec))) 
lines(mv,fmv2,lwd=3); # Exact posterior density of m 
points(median(y),0, pch=4,cex=2,lwd=3 ) & True value of m 
legend(8,1,c("Histogram estimate","Exact density"), Ity=c(2,1),lwd=c(3,3), 
bg="white") 
legend(8,0.6,c("Rao-Blackwell","True"), pch=c(16,4), 
pt.cex=c(1.5,2), pt.lwd=c(1,3), bg="white") 


hist(ybarvec, prob=T,xlim=c(8,12),ylim=c(0,1), breaks=seq(3,18,0.25), 
xlab="ybar",main="(b) Histogram of 1000 ybar-values") 
lines(density(ybarvec),lty=2,lwd=3) # Histogram estimate 
abline(v=c(ybarhat, ybarci, ybarcpdr),lty=2,lwd=3) # Histogram estimates 
ybarv=seq(8,13,0.02); fybarhatv=ybarv; 
meanvalvec = (1/N)*( ysT+(N-n)*(avectbvec*xrbar) ) 
varvalvec = (N-n)/(lamvec*N^2) 
for(k in 1:length(ybarv)){ 
fybarhatv[k]z mean( dnorm(ybarv[k], meanvalvec, sqrt(varvalvec))) } 
lines(ybarv, fybarhatv,lty=1,lwd=3) # Rao-Blackwell 
points(mean(meanvalvec),0,pch=16,cex=1.5) # Rao-Blackwell 
points(ybar, 0, pch=4,cex=2,lwd=3 ) # True value of ybar 


legend(8,1,c("Histogram estimate","Rao-Blackwell"), 
Ity=c(2,1),lwd=c(3,3), bgz"white") 
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legend(8,0.6,c("Rao-Blackwell","True value"), pchzc(16,4), 
pt.cex=c(1.5,2), pt.lwd=c(1,3), bgz"white") 


hist(psivec, prob=T,xlim=c(1.25,2.5),ylim=c(0, 3), breaks=seq(0,10,0.05), 
xlab="psi",main="(c) Histogram of 1000 psi-values") 

denzdensity(psivec); lines(den, Ity=2,lwd=3) 

abline(v=c(psihat, psici, psicpdr),Ityz2,Iwdz3) 


# psimode=denSx[(1:length(denSx))[denSy==max(denSy)]] # optional extras.... 
# psimedian=median(psivec); abline(v=c(psimode,psimedian), Ity=1,lwd=3) 


points(psi, 0, pch=4,cex=2,lwd=3 ) # True value of psi 
legend(2.05,3,c("Histogram estimate"), Ity=c(2),lwd=c(3), bg="white") 
legend(2.05,2,c("True value"),pch=c(4), pt.cex=c(2), pt.lwd=c(3), bgz"white") 


# Perform exact inference on ybar using a function from a previous exercise: 
NNGFPM= function(etazO, tau=0, alp=0.05, 
ys= c(5.6,2.3,8.4,5.1,4.3), X=rep(1,15), N=15, sigma=diag(rep(1,N)) ) { 
# This function performs inference under the normal-normal-gamma 
# finite population model. 
# Inputs: eta, tau, alp, ys, X, N, sigma 
# Outputs: A list with Sa, Sb and Sc indicating (ybar-a)/b given ys ~ t(c) 
p=ncol(cbind(NA,X))-1; n=length(ys); c=2*etat+n-p 
ysT=sum(ys); Xs=cbind(NA,X)[1:n,][,-1]; Xr=cbind(NA,X)[(n+1):N,][-1] 
sigmass=sigma[1:n,1:n]; sigmarr=sigma[(n+1):N,(n+1):N] 
sigmasr=sigma[1:n,(nt+1):N];  sigmars-t(sigmasr) 
D=solve(t(Xs)%*%solve(sigmass)%*%Xs) 
beta-D96*96t(Xs)96*96solve(sigmass)96*96ys 
A=Xr-sigmars%*%solve(sigmass)%*%Xs; ^ oner=rep(1,N-n) 
a-(1/N)*( ysT + t(oner)%*% 
( Xr%*%beta + sigmars%*%solve(sigmass)%*%(ys-Xs%*%beta) ) ) 
b2-(1/(c*N^2)) * ( 2*tau + t(ys-Xs%* %beta)%* %solve(sigmass)%*% 
(ys-Xs96*96beta) ) *  t(oner)96*96 
(sigmarr-sigmars9?6*96solve(sigmass)96*96sigmasr -A96*96D96*96t(A)) 96*96 
oner 
b=sqrt(b2); codr=atc(-1,1)*qt(1-alp/2,c)*b 
list(a=a,b=b,c=c,beta=beta, cpdr=cpdr) } 


res= NNGFPM( eta=0, tau=0, alp=0.05, ys=ys,X=X,N=N, sigma=sigma_) 
c(resSa,resSb,resSc, resScpdr) #10.1744 0.4035 18.0000 9.3267 11.0221 


# Plot for inference on ybar again 
hist(ybarvec,prob=T,xlim=c(8,12),ylim=c(0,1), breaks=seq(3,18,0.2), 
xlab="ybar",main="(d) Histogram of 1000 ybar-values") 
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abline(v=c(ybarhat, ybarci, ybarcpdr),Ityz2,Iwdz3) # Histogram point estimates 
points(mean(meanvalvec),0,pch=16,cex=1.5) 

# Rao-Blackwell estimate of predictive mean 
abline(v=c(resSa,resScpdr), Ity=1, lwd=3) # # Exact point estimates 
points(ybar, 0, pch=4,cex=2,lwd=3 ) # True value of ybar 
lines(density(ybarvec),lty=2,lwd=3) # Histogram estimate of predictive pdf 
lines(ybarv, fybarhatv,lty=3,lwd=3) # Rao-Blackwell estimate of pdf 
lines(ybarv, dt((ybarv-resSa)/resSb,c)/resSb,lty=1,lwd=3) tt Exact predictive pdf 
legend(8,1,c("Histogram","Rao-Blackwell","Exact pdf"), 

Ity=c(2,3,1),lwd=c(3,3,3)) 
legend(8,0.5,c("Rao-Blackwell","True value"), 

pch=c(16,4), pt.cex=c(1.5,2), pt.lwd=c(1,3)) 
text(11.65,0.8, 

"The solid vertical lines\nshow the exact \npredictive mean\nand 95% CPDR") 


# Detail in last figure 
hist(ybarvec, prob=T,xlim=c(10,11.5),ylim=c(0,1), breaks=seq(3,18,0.2), 

xlab="ybar",main="(e) Detail in subplot (d)") 
abline(v=c(ybarhat, ybarci, ybarcpdr),lty=2,lwd=3) # Histogram point estimates 
points(mean(meanvalvec),0,pch=16,cex=1.5) 

# Rao-Blackwell estimate of predictive mean 
abline(v=c(resSa,resScpdr), Ity=1, Iwd-3) # # Exact point estimates 
points(ybar, 0, pch=4,cex=2,lwd=3 ) # True value of ybar 
lines(density(ybarvec),lty=2,lwd=3) # Histogram estimate of predictive pdf 
lines(ybarv, fybarhatv,lty=3,lwd=3) # Rao-Blackwell estimate of pdf 
lines(ybarv, dt((ybarv-resSa)/resSb,c)/resSb,lty=1,lwd=3) # Exact predictive pdf 
legend(11.1,1,c("Histogram","Rao-Blackwell", 

"Exact pdf"),Ity=c(2,3,1),lwd=c(3,3,3)) 
legend(11.1,0.6,c("Rao-Blackwell","True value"), 

pch=c(16,4), pt.cex=c(1.5,2), pt.lwd=c(1,3)) 


# Exact values of the quantities of interest and summary estimates ------------ 
c(m,mhat2,mhat,mci,mcpdr) 

# 11.000 10.895 10.906 10.875 10.937 9.893 11.863 
c(ybar,resSa,ybarhat,ybarci,ybarcpdr) 

# 10.473 10.174 10.185 10.158 10.211 9.353 11.049 
c(psi,psihat,psici,psicpdr) # 1.435 1.659 1.650 1.668 1.444 2.014 


# Preparation of data for input to WinBUGS ---------------------------------------- 
paste(as.character(round(ys,2)), collapse=",") 

# 14.98,10.99,9.58,6.56,13.83......., 10.66,10.41" 
paste(as.character(round(c(xs,xr),2)), collapse=",") 


# 19.34,18.2,14.27,10.91,13.45,.....,12.57,10.36,19.49 
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WinBUGS Code for Exercise A.3 


model 
{ 
for(i in 1:100){ 
mu[i] <- a + b*x[i] 
yli] ~ dnorm(mu[i],lam) 
) 
a ~ dnorm(0.0,0.0001) 
b ~ dnorm(0.0,0.0001) 
lam ~ dgamma(0.0001,0.0001) 
m <- a*16*b 
ybar <- mean(y[]) 
max «- ranked(y[],100) 
medL <- ranked(y[],50) 
medU «- ranked(y[],51) 
med <- (medL + medU)/2 
psi <- max/med 


} 


# data 
list(y=c( 
14.98,10.99,9.58,6.56,13.83, — 11.38,9.13,13.25,7.03,11.14, 
2.74,11.97,12.15,9.39,11.71, — 10.25,7.98,8.54,10.66,10.41, 
NA,NA,NA,NA,NA, NA,NA,NA,NA,NA, NA,NA,NA,NA,NA, NA,NA,NA,NA,NA, 
NA,NA,NA,NA,NA, NA,NA,NA,NA,NA, NA,NA,NA,NA,NA, NA,NA,NA,NA,NA, 
NA,NA,NA,NA,NA, NA,NA,NA,NA,NA, NA,NA,NA,NA,NA, NA,NA,NA,NA,NA, 
NA,NA,NA,NA,NA, NA,NA,NA,NA,NA, NA,NA,NA,NA,NA, NA,NA,NA,NA,NA), 


x=c(19.34,18.2,14.27,10.91,13.45,13.3,11.31,16.62,13.07,17.45,10.55, 
17.66,17.34,17.46,16.14,17.19,10.96,14.19,16.08,14.83,17.92,16.61, 
14.52,16.7,12.28,14.61,14.51,11.5,15.17,16.72,11.27,15.21,16.34, 
10.36,12.62,19.27,19.7,12.26,10.07,18.74,11.86,12.35,16.79,13.18, 
14.05,17.52,18.17,18.7,18.1,10.17,10.26,12.95,12.64,12.35,18.39, 
12.08,17.48,13.47,14.47,16.76,17.64,14.32,19.07,17.29,15.87,14.2, 
18.49,14.69,13.57,14.74,12.41,19.99,18.39,16.43,15.6,15.74,18.33, 
16.98,16.72,19.3,13.92,11.4,11.55,13.83,12.36,13.3,15.3,19.26,18.15, 
17.75,10.72,13.78,13.2,14.98,13.53,10.19,16.46,12.57,10.36,19.49)) 


H inits 
list(a=0,b=0,lam=1) 
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Exercise A.4 Case study in Bayesian finite population models 
with biased sampling 


A finite population of size N = 4 consists of values y,,..., y, that are iid 
Bernoulli with parameter 8 . 


A priori, @ is equally likely to be 1/4 or 3/4 (with no other possibilities). 


We are interested in two quantities: 


the superpopulation mean O=E(y,|9) 


yi tt yy 


the finite population mean y= N 


We sample n = 2 units from the finite population without replacement in 
such a way that 


every sample is equally likely to be selected, apart from one exception, as 
follows: 


if the value of unit 1 is 1 then each sample with unit 1 is twice as likely to 
be selected as each sample without unit 1. 


We observe the values of the two sampled units (each being 0 or 1) as well 
as the labels identifying them (each being 1, 2, 3 or 4). 


(a) Write down a suitable Bayesian model for the above scenario in terms 
of the densities of the parameter O , the finite population vector, 
y 5 yo yy) , and the sample, S= (S,,...,5,). 


Your formulae may involve only these variables, as well as n, N, and the 
vector of inclusion counters, I = (1,,..., 1,) , where I; = 1 if the ith unit is 
in the sample, and I; = 0 otherwise. (Note that there is a one-to-one 
correspondence between s and I in this exercise.) 


(b) Identify a condition which determines whether the sampling 


mechanism is ignorable or nonignorable. Then write down an expression 
for the density of s in each of these two cases. 


638 


Appendix A: Additional Exercises 


(c) Derive the posterior density and mean of @ generally. 

(d) Find the model bias of the posterior mean of @ if: 
(i) 9 = 1/4and s= (1,3) 
(ii) 9 = 1/4 and s 2 (2,3). 

(e) Find the design bias of the posterior mean of @ if: 
(i) 90 =1/4and y = (0,0,1,1) 
(ii) 9 = 1/4 and y = (1,0,1,1). 

(£) Derive the predictive mean of y generally. 

(g) Find the model bias of the predictive mean of y if: 
(i) 8 = 1⁄4 and s= (1,3) 
(ii) 9 = 1/4 and s = (2,3). 

(h) Find the design bias of the predictive mean of y if: 
(i) 8 =1/4and y = (0,0,1,1) 
(ii) 0 = 1⁄4 and y = (,0,1,1). 


(i) Design and run a Gibbs sampler to check the posterior mean of 0 
in (c) and the predictive mean of y in (f). 


(j) Use Monte Carlo methods to check the two design biases in (h). 


(k) Find the mean of the predictive mean of the finite population mean. 
Then apply Monte Carlo methods to check your answer. 
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Solution to Exercise A.4 


(a) Part of the Bayesian model is: 


fio -[[e"a-or* 


i=1 


f(0)21/2,0 =1/4,3/4., 
As regards the sampling mechanism, if y, =0 then 


N -1 
f(s|y,0)= f(s) ZH » 5—(,2),(0,3),(,4), (2,3), (2, 4), (3, 4) . 


Also, if y, =1 then 


fl» fly)- |? e 


2c, ies 


_| € $-(52,0,3.(,4) 
 |2ce, s=(2,3),(2,4),(3,4)| 


To find the value of c, we may equate 
1- Y f(s|y) 2 cx3* (20) x32 9c. 


We thereby obtain c= 1/9. 


Note 1: Alternatively, we may observe that 
f (s Da = CHE 

where 
TESE 


Hence 


je fisly)=eL a+ -e [Xi Ye € 2 


4-43) 0-0) 
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Note 2: There are a total of | ] samples s which contain any given 


particular unit i. So if y, 21 then 


E 2 e sS 
fly)» feld AE 


Putting together the two cases above ( y, = 0 and 1), we see that the 
sampling mechanism is given generally by 


f(s|y,0)-7 f(s|y.) 
B 1+1y, 


(NY (N-1)\ 
HENDE 
-i*h5A sR 


6+3y, 
where of course I, = 1(s € {(1,2),(1,3),(L4)}). 


As a check, it is useful to list all of the values produced by this formula. 
These values are as shown in Table A.2. Observe that the sum of f (s | y,) 


over all values of s is equal to 1, both when y, =0 and when y, =1. 


From Table A.2 we may also confirm that, as specified in the problem: 
every sample is equally likely to be selected, apart from one exception, as 


follows: if the value of unit 1 is 1 then each sample with unit 1 is twice as 
likely to be selected as each sample without unit 1. 


Table A.2 All possible values of s and their probabilities 


Sample, s: (1,2) (1,3) (L4 (2,3) (2,4) (34) 
I, - I(1e s): 1 1 1 0 0 0 

f (s|y, 20): 16 16 16 16 16 16 
f(sly, 22: 279 2/9 2⁄9 19 19 19 
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(b) If unit 1 is selected (les, I; 21) then y, = 0 or 1 is known and so 
the sampling mechanism is ignorable. In that case, 
1/6,y, 20 3/18, y, 20 
f(s|y,6) - LX - 7 ea 
6+3y, |2/9,y,=1 4/18, y, =1 18 
s -(5,2),(0,3),(1,4). 


b 


Conversely, if unit 1 is not selected (1¢s,J,=0) then y, = 0 or 1 is 
unknown and so the sampling mechanism is nonignorable. 


In that case: 
1/6,y, 20 3/18, y, 20 = 
f(s|y,@)= 1 = Yı = Yı _3 Yi 
6+3y, 1/9,y;=1 2/18, y, —1 18 
$-(2,3,(2,4),(3,4). 


(c) The posterior distribution of 0 given data D = (s, y,) can now be 
derived by considering the two cases in the note above. 


First, if unit 1 happens to be sampled then the value of the sampling 
density f(s|y,0) is known, and so the sampling mechanism is ignorable. 


Explicitly, we find in that case, 
f(O|D)= f(0|s. y) © FOS y) - 3 f(6,s y, y.) 
= FOF, Of G1...) 
-YX ffo. lO) f(y, |O) f s] y.) 
- TONO FSI ID f(y, |) 


since f(s|y,0) = f(s|y,), where s is fixed at its 
observed value, s= (s,s,) = (1,2), (1,3) or (1,4) 


0 
« f (9) f(y, |8) x1x1 
since f(s|y,) does not depend on 8. 


Note: This is the point at which f(s|y,0) can be ‘ignored’. 
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Thus we have that 


f (8 | D) oc] x II (1- 9g)" 


ies 


= x (1- gy? 


Sst Yst 2-Ysr 
TRE e 
5 YsT 1 Yst 2-Ysr 


A 6=1/4 
3". @=3/4 


lars, aad 


yer 9-3/4 
_[ 9, @=1/4 
 |9'", ga3/4[' 


That is (if 1€ s ), 


oc 


oc 


9/10, 0-1/4 
(sr = 


9 1/10, 0-3/4 
gg cd 1/2, 0-1/4 
f(8|D)- - RESI 
9r 1/2, 0=3/4 
—,9=3/4 
Bx 9 1/10, @=1/4 7 
9/10, 0=3/4 Yor = 


So then also (if 1€ s ) the posterior mean of @ is 


1/9 3 1) 3 E 
440) 4\40) 10 ?* 
‘ 1(1) 3(1) 1 
ThE a 
1(1),3(9] 7. _, 
4110) 40) 10 ^7 ^ 


Note: This could also be written as Ó — — f les). 
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Next, suppose that unit 1 is not sampled. Then the value of unit 1 is 
unknown and so the sampling mechanism is nonignorable. 


In that case, we see from (b) that 
3— 
f(1y.6)- f(Iy) 2 52 


where y, is an unknown value in the nonsample vector y, — (y, y,) 
where k = 2, 3 or 4. 


s = (2,3), (2,4), (3,4), 


Working through as before, 
f(0|D)- f(0|s.y,) 
æ f(8,s y.) 
=D FOS y. y) 
Yr 


g^ f ) f Cy, y, lO) f (S| y, y,. 0) 
-Xroro. lO) f(y, [A f Cs] y.) 
- - (OF. 2, f (s| yo f Gy, |0) 
= = f()f, 169909. 


where 
q(0) «c X (3- 0) f Cy, | 0) 
- E, (3-010) 
= 3-80. 


Note: We could also have written 


q(8) c Y; Y G-y)[a- 6 e^a-ey] 


E p 0" (1-0) | Y G-y)0^a- 6) 
-1x[3-0)0*10- 60)? +3- D) 1-0) } 
= 3(1- 0)-- 20 

-3-8. 
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Having shown (in the case 1¢ s ) that 


f (0| D) « f (0) f Cy, | O)(3 - 6), 


it now follows that 


Ysr Ysr 2-Yer 
age er 
f (0| D) « 

2 M 1 Yst 2-Ysr 3 
JOEY em 


a? x11, eld 
a7 x9, G=3/4 


3x11, 0=1/4 
es 0- 24] 
0251/4 
lox, me: 


Thus (if 1¢ s ), we have that 


oc 


11/12, @=1/4 
, y, 70 
11497" PRUA 11/20, 0-1/4 
f(8|D)- - » Ver =1 
gsr 9/20, 0=3/4 
—, 0=3/4 
114.9%" 11/92, 0-1/4 E 
81/92, 0-3/4[' Yer = 


So then also (if 1 s ) the posterior mean of @ is 
6 = E(0|D) 


1(11),3(1] 14, 7 805 002917, y =0 
4(12) 412) 48 24 2760 
1 

-| i(31),3(9) 38 1 132 04750, , 4 
4420) 4420) 80 40 2760 


1/11) 3/81) 254 127 _ 1905 06902, y, =2. 
4\92) 4\92) 368 184 2760 
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Note: This mean may also be written as 
805+ 462 y... + AAy?, 
2760 


6= (if 1e s). 


This alternative formula was obtained by solving the equation 
Bos x0 
a bx * cx! 241311, x=1 
IESUS x2 
for a, b and c. 


Putting the two cases together we find that the posterior mean of @ is 
given generally by: 


6 = E(0| D) = (D) = Ó(s, y,) 


3/10=0.3000  iflesand y, -0 
1/2=0.5000 | iflesand y, -1 
7/1020.7000 iflesand y, =2 
7/24-20.2017 ifl¢sand y, -0 
19/40 =0.4750  iflesand y,-1 
127/184=0.6902 iflesand y, =2, 


or equivalently, by 


2 
G= (3524), , [805+ 462y., + 4dys, 
10 2760 


Ja I). 


Note: Here: 
les e Tus — 12) 3) gp 
les e 120 e s=(2,3),(2,4) or (3,4). 


Also: 
y, =0 iff both sampled values are 0 


y,, =1 iff one sampled value is 0 and the other is 1 


Yr = 2 iff both sampled values are 1. 
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(d)(i) If 6 = 1/4 and s=(1,3) then 1e s and I, =1, and so 
2 
(3:22 y. (Berta He y= St er 


6= 
10 2760 10 


So the model mean E Ê is 
E(040, Em «280, |6,s)]. 


Now, 
f (y,s|0) 
f(y |@,s) 2 —— —, 
f (s|0) 
where: 
3+ y, : Yi 1-y 
Fontem reo ' 
(using the result in (b) that f(s | y,0) = na if 1es) 
f(sID=> fs10) =F fGIy,O)fCy|0) - E, [fG1y,0)]0] 
nA y 
-p (H2 e| _ 3+0 
18 18 
Therefore 
en []e'a-o 
0, - i=1 
f (y 0, s) Er 
18 
-(Zea- 9)" "J[Te'a- oy, 
i=2 
We see that 
(y, |8,s) ~L Bernoulli(z;), i=1,2,3,4, 
where: 
Ny, =%,=1,=0 
i= 341g 4- 0) = e. 
34-0 3-0 
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Check: 3*0 o1 =0) = le il a m. 
3-0 34-0 34-0 
It follows that 


E(y,. |0,s) = Ey, 6,5) + E(y, | 4,5) 


1 741 
40 0(7+0) 4\ 4) 29/16 29 


= 7 +T, = 0+—— = = -——-——,. 
340 3+0 EH 13/4 52 


Hence 


E(6|0,s)=— 3+2( 2) -107 ~ 9.4115. 
10 52){ 260 


So, if 0 = 1/4 and s =(1,3), then the model bias of Ó is 
E(0—0|0,s) = 29 caue e e TIRE 
260 260 4 130 


Note: We can also report the relative model bias of Ô as 


“(2 às] _ 21/130 42 


— = +64.6%. 
1/4 65 
(d)(ii) If 6 = 1/4 and s =(2,3) then 1er and I, =0, and so 
2 
à - (3*2 y, (805 462y., 445) 
10 2760 

_ 805+ 462y., +44 Yin 
2760 


So the model mean of @ is 
805 + 462E(y.,. |0,s) + 44E(y-, |0,s) 
2760 l 


E(0|0,s) = 
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In this case, 
Mias- f(y,5|9) 
f(s |) 


as before, but with 


f(y.s10)= fGIy.OfQ16) -—— T a-o 


18 
(using the result in (b) that f (s | 6-2 if1¢s). 
Thus, 
fG|9)- * fO.slO=> fsly.Af19 
X y 
3= 
=£,{/061.0)10} =E, [So 
E 
18 ` 
So 
3-J TT o» 1-y; 
—-| |0” 1-6)” 
f - e oA | 
18 
- 3-J (4. gu» : Yi 1 gyi 
=| —— 0 a -0 [7 a-0 >. 
3-0 i-2 
We see that 
(y, |0,s) ~L Bernoulli(z,), i 2 1,2,3,4, 
where: 
qmm mg 8 
7, -3lgg.8y = 20 
3-0 3-0 
Check: 3=9 po _ gy Be Coe) ee ee 
3-0 3-0 3-0 
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It follows that 
E(Y.r |0,s) = Ey; |0, s) + E(y,|0,s) 
"d 
4 4 2 
Equivalently, 


(y, |0,s) ~ Bin(2,0), 
and so 
E(y,, | 9,5) 2 20. 


By the same token, 
E(y; |9,8) =V (Yer 15,0) + {E(or | 8, OY 


E 2oxl(1,1]45 
-20(1—8)-- (207 2 20(1- 0) 22 Hast) = 


Hence 
805 +462[ 5)+-44{ 2 2127 
E(0|0,s) = = = 0.3853. 
2760 5520 


So, if 0 =1/4 and s = (2,3), then the model bias of Ê is 
E(Ó —0|0,s) = EB EE X. - 0.1353. 
5520 4 5520 


Note: As regards the model bias of 6, there are a total of 4 cases, 
corresponding to whether les or 1¢s, and to whether 0=1/4 or 
0 =3/4. We have covered two of these four cases. 


(e)(i) If 0 = 1/4 and y =(0,0,1,1) then y, =0. So in that particular case 
the sampling mechanism is definitely SRSWOR and ignorable. Without 
further thought, the posterior density of 0 can be obtained as follows: 


f(8|D)- f(8|s. y.) - fly.) 
x f(8)f (y, 18) 
«1x[ [a-o . 


ies 
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Recalling (c), note that 
9/10, 0-1/4 
1/10, 0-3/4]' 


E 1/2, seid E 
|] 3, 8-3/4[ 7 7 


1/10, 0-1/4 
s Vsr m 


9/10, 0=3/4 
and 
3/10, y, -0 
^ oP eve 
7/10, y, 22 


The design mean of @ is therefore 
E(6|0,y) - 332E0 16.) 
10 
where 


E(y, |, y) - nE(y, |8, y) =n} y. f (56. y) 2 ny, 


since (making use of basic results in the classical theory) 


f(s|8,y)- ro- (7 -2 i a1, 


Therefore the design mean of Ó is 


^ 3+2x1 1 
E(0|0, y) - x 


10 2 


So the design bias of Ó is 
^ 1 1 
E(0-0|0,y) 2 ——0-——- 
( If ae 5 


Note: In the above, E (0 |0, y) does not depend on @. So, for the case 


0 = 3/A and y =(0,0,1,1), the design bias of @ is 1 = —0.25. 
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(e)(ii) If 0 -1/4 and y=(1,0,1,1) then y, = 1, and so the sampling 
mechanism is potentially nonignorable (depending on which sample s 
happens to be drawn). 


Recall from (c) that the posterior mean of @ is a function of the data given 
generally by 
3/10 = 0.3000 iflesand y, =0 


1/2 = 0.5000 iflesand y, =1 
7/1020.7000 iflesand y,, =2 
7/2420.2917  iflesand y, -0 
19/4020.4750  iflesand y,-1 
127/184-0.6902 iflesand y, =2. 


Ó -Ó(s, y.) = 


Also recall from (b) that 
S*X. s—(,2),(53),(,4) 


fl», 


2 s = (2,3), (2,4), (3, 4) 


The design bias of Ó can now be worked out according to 


E(010, y) - X s.v) f(s10.»). 
Now, suppose that we draw the sample s = (1,2). 


Then y, - (y, Y) = (1,0). 


Thus les and y,, =1, and so by the above, 


1 3+1 1 
As s|Q, y) 2 Cx ——--. 
(s. y.) f (s|0, y) 34m 6 
Likewise: 
If s= (1,3) then y, = (V, y;) = (1,1) and so 
7 341 7 


ó(s, Yy.) f(s|0, ae 18 45 
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If s - (,4) then y, 2 (y, y) = (1,1) and so 
A 341. 7 
Os, y.) f (s19, a Eos 


18 45. 

If s 2 (2,3) then y, 2 (yj, y;) = (0,1) and so 
ôs yfl y) = x= um 

If s 2 (2,4) then y, =(y,,y,) = (0,1) and so 
G(s, y fley = x=. 


If s 2 (3,4) then y, 2 (y, y,) = (1,1) and so 
127 3-1 127 
(sy. f (s10, y) = 


18 1656 
It follows that 
E(0|0, y) - V Os, y. f(s|0, y) 
= (1/9) + (7/45) + (7/45) + (19/360) + (19/360) (127/1656) 
= 0.6045. 


Thus, if 0 = 1/4 and y = (1,0,1,1), then the design bias of Ê is 


E(0—0|0,y)- 0.6045 -2 = 0.3545. 


Note 1: Also, if 0 = 3/4 and y =(1,0,1,1), then the design bias of Ê is 
0.6045 3 = —0.1455. 


Note 2: As regards the design bias of ô, there are a total of 
2x4x2=16 cases to be considered, corresponding to 


hu being either 0 or 1 (2 possibilities) 
yp-J, being Oorl or2or3 (4 possibilities) 
0 being either 1/4 or 3/4 (2 possibilities). 


We have covered four of these 16 cases 
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(f) Recall from (c) that 
(y, |0,s) ~L Bernoulli(z;), i 21,2,3,4, 


where: 
T,-4,-m,-0 
cae les 
3-0 
^7 99 
——, les 
3-0 
Therefore 
E(y., | ) - Ey, 10,5) 0-0, les 
sS, S = r S = , 
Yn y Yr 0+ġ, les 
where 
20 
$-3-9 
So 


E(y,, |5; y.) = EtE(y, | 935; Y.) |S, y.) 
E(20|D), 1es 
x ERA " 
_| 28, les 
dl 1e D 


^ 20 
é- Egi - E 2p) 


20 
=> (4 f(@|D). 


0-1/4,3/4 


where 


The finite population mean is 


_ 1 
y —a Ur + Vie) 


and so the predictive mean of y may be expressed as 


m E 1 
y-E(yIs y) 7 Va + EO I5 y. 
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Using suitable R functions, we find that ó and y are as follows: 


If 1es and y, - 0 then ¢ =0.2303030 and y = 0.1500000 
If 1es and y,, =1 then ¢ =0.4242424 and y = 0.5000000 
If les and y,, -2 then ĝ =0.6181818 and y = 0.8500000 


If les and y,, =0 then à =0.2222222 and y = 0.1284722 
If les and y, =1 then ¢ =0.4000000 and y = 0.4687500 
If 1e s and y, =2 then ¢ =0.6086957 and y = 0.8247283. 


Note: Working through the above equation using exact fractions, it can 
be shown that 


3/ 20, LES Va 
Wy esis, 
1720 su 
37/288, eS. yee 
1592. Dess 
6077 736," Les, V7 


=0 
=l 
22 zip 
Vis VAS.) = E 
=] 
= 


The following are details of the working for 37/288, 15/32 and 607/736. 


Observe that 


20 _ 06-0) 


E 0,s,y,) 20 
(y 10,5. y.) EET RTT 


Therefore 
^ 0(5—8 
$a = EGG I ys E [6-2 ) 


Bg 


sj. 
So, if Y =0 then 


5-3) (s-3] 
deem dw GALILEO 
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1/19 3[17 
dh da "sj 
- Lem 74 37 

48| 3 4853 720 


ES S 
An Ales 
+ 


als 
ja EdE) 1 1 3 
3-0 poe 20 3020 
4 4 
A 
To Nf up Seige si) = 
N20 EN 8 
A A 
And if y,, =2 then 
1 -7 5-2 
jn =| 2E Dip =4 E 11 4 E 81 
3-0 qu 82) EN UNS 
4 
ae) Se 
Nes o 
A A 
ae ee 
368 368 184 


Thus (for 1¢ s ) we have that 
37172, y, =0 


Yo =y 7 vost 
239/184, y, =2. 
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Hence 
0+37/72 = 37/72, Yo = 


Yr = Ver + Yer = Ymr = 1+7/8=15/8, Yer =1 
2+ 239/184 =607/184, y., =2. 


Thus, finally (for 1¢s ), we obtain 
57/2800 y m0 


y= =115/32, y,=1 
607/736, y, =2. 


A similar logic can be used to obtain the fractions 3/20, 1/2 and 17/20. 


(g)(i) Suppose that 80 = 1/4 and s = (1,3). Then les and so 
(Y5«5Y4]0,s) ~L Bernoulli(z,), 
where: 


In this case, 
Yer = Vit Y3, 
and so: 
cae 8. 27 
P 4 52 
bus 4 
P =2|0,s)=—x— 
27 4 21 


52. B9- 53. 


P(Y =0|0,s)= 


P(Y =1|0,s)=1 
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So the model mean of y is 
E(Y|0,s) - EE(y 16,5, y.) 6.) 
= EO. Y,4)16,s) 
2 ^ 
= >, y(S. Ysr) f Or | 0,5) 


Yep =0 
= 0.15(27/52) + 0.5(21/52) + 0.85(4/52) 
= 0.3451923. 


Also, the model mean of y is 
1 fa. Tod 1 
E(y|6,s)=—(a, +...+7,)=—| —+—+-—4+-— 
(16s) - c Gr tm) (4 13 =) 
= 55/208 = 0.2644231. 
So the model bias of y is 
E(y — y |0,s) = 0.3451923 — 0.2644231 = 0.08077. 


(g)(ii) Suppose that 0 = 1/4 and s = (2,3). Then 1e s and so 
(yi... Y4 | 8,5) ~L Bernoulli(z;) , 
where: 


In this case, 


Yer =Y: +Y 
and so: 

Px -0|6,s) exe 

POs 2|) e zoe — 

PVs =116,)=1-2 -1-2 
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So (using results in (g)(i)) the model mean of y is 
0.1284722(9/16) + 0.46875(6/16) + 0.8247283(1/16) = 0.2995924. 


Also, the model mean of y is 
1 Erg dq oT R | 
E(y|0,s) 2 (zz, ^... 2,) 2 | — ——4— 
riesco Un 4) (2 2 ] 


= 41/176 = 0.2329545. 


So the model bias of y is 
E(y -Y |0,s) = 0.2995924 - 0.2329545 = 0.06664. 


(h)(i) Suppose that 0 = 1/4 and y = (0,0,1,1). Then y, = 0 and so the 
sampling mechansim is definitely SRSWOR and ignorable. 


Explicitly, we have that 
f(s|8,y) = f(s) - 1/6. 


So the design mean of y is 
E(y |0, y)= E{E( |0, y, s) 0, y} 


-Y $.»)fGl6.y) -:Xj6. ») 
: B 2), (0,0) + ¥((1,3),(0,1))  3((1,4), (0,1) 
+ 3(2,3, (0,0) + (2,4), (0,1) + (3,4), 0,1))] 
= (1/6)(0.15 + 0.5 + 0.5 + 0.46875- 0.46875 + 0.8247283) 
= 0.4853714. 


Also, the design mean of y is 
E(y|@,y) =(0+0+1+ 1y4- 0.5. 


So the design bias of y is 
E(y — y |8, y) = 0.4853714 — 0.5 = —0.01463. 


Note: The derivation of this result did not involve 0 . So for the case 
0 - 3/A and y = (0,0,1,1), the design bias of y is also —0.01463. 
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(h)(ii) Suppose that 0 = 1/4 and y = (1,0,1,1). Then y, =1 and so the 
sampling mechansim is possibly nonignorable, with 


3+ y. = 341 _ > s = (1,2), (1,3), (1,4) 


18 18 
— S= 273); 2,4), 3,4 
T MEET ME (2,3), (2, 4), (3, 4) 


So the design mean of y is 


E(y|0, y) - EUECy |0, y,s)| 0, y) = X y (s. y) f (S10, y) 
= F((1,2),(1, 0) +53) (1 D +54, DÉ 


*3(2,3.0.0) + ACA 0DE AG 4), CEDE 


= (2/9)(0.5 + 0.85 + 0.85) + (1/9)(0.46875+ 0.46875 + 0.8247283) 
= 0.684692. 


Also, the design mean of y is E(y|0, y) =(1 +0 + 1 * 1)/⁄4= 0.75. 
So the design bias of y is E(y — y |0, y) = 0.684692 — 0.75 = —0.06531. 


Note: The derivation of this result did not involve 0 . So for the case 
0 - 3/A and y = (1,0,1,1), the design bias of y is also —0.06531. 


(i) A suitable Gibbs sampler is based on the joint density 
4 


f(s, y,@) = f(9) f(y A) f(sly,@) «ix[ [ea- 0)? thn 


6+3y,_ 


We can identify three conditional distributions here. First observe that 


4 
f(Ols,y)<] [P a-0™ =0" 1-8)", 0=1/4,3/4 
i=1 


(A.1) 


_ | (8/4) (1-1/4)7*,0-1/4 
(1/4) 1-3/4)", @=3/4. 


Next, recall from (d)(ii) that 
(y, |0,s) ~L Bernoulli(z;), i=1,2,3,4, 
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where: 7, =7,=7,=0 
ed 2 
mec iaa e 
3-0 3-0 


Now, the second component of r = (r, r,) must be 2, 3 or 4. 


Therefore 
(y, |8, s, Y, Y,) ~ Bernoulli(@) . (A.2) 


However, there are two possibilities for y, . If the data is such that s, =1 


then 
(y, |0, s. y, y, ) ~ Bernoulli(@) . (A.3) 


On the other hand, if the data is such that s, » 1 then n =1, and this 
implies that 


(y, [8,5 Y: ,) ~ Bernoult 24) ; (A.4) 


Equations (A.1), (A.2), (A.3) and (A.4) imply three conditional 
distributions which define a suitable Gibbs sampler (for 0, y, and y, ). 


Note: At (15.4), the ratio of probabilities of y, 2 0 to y, = lis 


Eu 
Bae) ole) (e 
20 2 alo 
Ex) 
which is exactly 3/2 times the ratio of the probabilities of y, — 0 to 


Ya = eau): (This observation provided some assistance when 
formulating the required R code, as detailed below.) 


Implementing the above Gibbs sampler, we obtained a random sample 
(8, Y Nex Organs eo ~iid f(0,y |D) 
for each of the six possible data configurations in (c) and (f). 


The respective sample means for 0 were: 
0.3007, 0.4924, 0.6997, 0.2952, 0.4764, 0.6925. 
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It will be observed that these numbers are very close to the corresponding 
values obtained in (c), namely 
3/10 = 0.3000 iflesand y, -0 


1/2 0.5000 iflesand y, =1 
7/10=0.7000  iflesand y, =2 
7/2420.2017  iflesand y, =0 
19/4020.4750 ifilesand y, -1 
127/184=0.6902 iflegsand y, -2. 


S» 
II 


The respective sample means for y were: 
0.1518, 0.4929, 0.8485, 0.1308, 0.4719, 0.8269. 


It will be noted that these are very close to the corresponding values 
obtained in (f), namely: 
0.15, 0.5, 0.85, 0.1284722, 0.4687500, 0.8247283. 


(j) To check the design bias in (b)(i) we note that for y = (0,0,1,1) the 
sampling mechanism is ignorable. 


So proceed as follows. Simply select one of the 6 possible samples 
randomly. Then calculate the corresponding value of y. Repeat another 
J —1 times, independently. Then take the mean of the simulated y 
values and subtract y - 2/4. 


Implementing this procedure with J — 10,000 yielded a point estimate of 
—0.01562 with 95% CI (—0.01945, —0.01179). This is consistent with the 
result 0.01463 in (b)(i). 


To check the design bias in (h)(ii) we note that for y = (1,0,1,1) the 
sampling mechanism is nonignorable with each sample containing unit 1 
twice as likely as each unit not containing unit 1. 


So, select a sample s from (1,2), (1,3), (1,4), (2,3), (2,4), (3,4), in such a 
way that each of the first three of these has probability 2/9 and each of the 
last three has probability 1/9. Then calculate the corresponding value of 


y. Repeat another J —1 times, independently. Then take the mean of the 


simulated y values and subtract y = 3/4. 
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Implementing this procedure with J = 10,000 yielded a point estimate of 
—0.06592 with 95% CI (—0.06944, —0.06239). This is consistent with the 
result —0.06531 in (b)(ii). 


(k) The mean of the predictive mean of the finite population mean is the 
same as the unconditional mean of the finite population mean, which is 
the same as the prior mean of the superpopulation mean, which in our case 
equals 1/2. Mathematically, 


Ey = EE(y |s, y,) bythe definition of y 


= Ey by the law of conditional expectation 

= EE(y |0) by the law of conditional expectation 
4 4 

- EO since EG16)- 3 EG,10) - 23:06 
i=1 isl 


11.3.1 1 
= Of (@) == x4 x= =. 
> 4242 2 


To verify this obvious result via Monte Carlo is a good final check on 
previous calculations. 


To this end, simulate 0 , then simulate y, then simulate s, hence obtain the 
data (s, y,) , then calculate the associated y . Then repeat all of the above 
independently another J —1 times. 


Implementing this procedure with J = 10,000 yielded a point estimate of 
0.4992 with 95% CI (0.4938, 0.5047). This is consistent with the answer 
of 1/2 above. 


R Code for Exercise A.4 


# (g) 
postfun = function(s=c(1,2), ys2c(0,1)) | ysTssum(ys) 
if(any(s==1)==T){ if(ysT==0) probs=c(0.9,0.1) 
if(ysT==1) probs=c(0.5,0.5) 
if(ysT==2) probszc(0.1,0.9) } 
if(any(s==1)==F){ if(ysT==0) probs=c(11/12,1/12) 
if(ysT==1) probs=c(11/20,9/20) 
if(ysT==2) probs=c(11/92,81/92) } 
probs } 


postfun()#0.50.5 Just testing 
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postfun(s=c(2,4),ys=c(1,1)) 4 0.1195652 0.8804348 


thetahatfun=function(s=c(1,2), ys2c(0,1))| probs= postfun(s=s,ys=ys); 
thetavals=c(1,3)/4; sum(thetavals * probs ) } 
thetahatfun()#0.5 Just testing 
thetahatfun(s=c(2,4),ys=c(1,1)) # 0.6902174 
phihatfun=function(s=c(1,2), ys=c(0,1)){ probs=postfun(s=s,ys=ys); 
thetavals=c(1,3)/4; phivals=2*thetavals/(3-thetavals) 
sum( phivals * probs) } 


phihatfun() # 0.4242424 Just testing 
phihatfun(s=c(2,4),ys=c(1,1)) # 0.6086957 


yrThatfun=function(s=c(1,2), ys=c(0,1)){ thetahat=thetahatfun(s=s,ys=ys) 
if(any(s==1)==T){ res=2*thetahat } 
if(any(s==1)==F){ 
phihat=phihatfun(s=s,ys=ys); res = thetahat + phihat } 
res } 


yrThatfun() #1 Just testing 
yrThatfun (s=c(2,4),ys=c(1,1)) # 1.298913 


ybarhatfun=function(s=c(1,2), ys=c(0,1)){ EyrT= yrThatfun (s=s,ys=ys) 
(sum(ys)*EyrT)/A } 


ybarhatfun() #0.5 Just testing 
ybarhatfun(s=c(2,4),ys=c(1,1)) 4 0.8247283 


smatzmatrix(c(1,2, 1,2, 1,2, 1,2, 2,3, 2,3, 2,3, 2,3), byrow=T,nrow=8, ncol=2) 
ysmat= matrix(c(0,0, 0,1, 1,0, 1,1, 0,0, 0,1, 1,0, 1,1), 

byrow=T,nrow=8, ncol=2) 
thetahatvec=rep(NA,8); phihatvec=rep(NA,8); ybarhatveczrep(NA,8); 


for(kin 1:8)( | thetahatvec[k]= thetahatfun(s=smat[k,],ys=ysmat[k,]) 
phihatvec[k]= phihatfun(s=smat[k,],ys=ysmat[k,]) 
ybarhatvec[k]= ybarhatfun(s=smat[k,],ys=ysmat[k,]) } 


cbind(smat,NA,ysmat,NA,thetahatvec, NA, phihatvec, NA, ybarhatvec) 

# thetahatvec phihatvec ybarhatvec 

4[1,]12NAO00NA 0.3000000 NA 0.2303030 NA 0.1500000 

# [2,12 NAO 1 NA 0.5000000 NA 0.4242424 NA 0.5000000 

4[3,] 12 NA10 NA 0.5000000 NA 0.4242424 NA 0.5000000 repeat OK 
# [4,12 NA 11NA 0.7000000 NA 0.6181818 NA 0.8500000 

4 [5,23 NAOO NA 0.2916667 NA 0.2222222 NA 0.1284722 
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# [6]23 NAO 1 NA 0.4750000 NA 0.4000000 NA 0.4687500 
8[7,]] 23 NA10 NA 0.4750000 NA 0.4000000 NA 0.4687500 repeat OK 
4[8,]] 23 NA1 1 NA 0.6902174 NA 0.6086957 NA 0.8247283 


0.15*(27/52) + 0.5*(21/52) + 0.85*(4/52) # 0.3451923 
0.1284722*(9/16) + 0.46875*(6/16) + 0.8247283*(1/16) #0.2995924 


# (h) 
(1/6)*(0.15 + 0.5 + 0.5 + 0.46875+ 0.46875 + 0.8247283) #0.4853714 
(2/9)*(0.5 + 0.85 + 0.85) + (1/9)*(0.46875+ 0.46875 + 0.8247283) # 0.684692 


# (i) Check posterior means and predcitive means via Gibbs sampler 
options(digits=4) 

GS=function(J=1000, s=c(1,2),ys=c(1,0), theta=1/4 ){ 
thetav=rep(NA,J); yrTv=rep(NA,J); yTv=rep(NA,J) 
yrmat=matrix(NA,nrow=J,ncol=2); ysT=sum(ys) 


for(j in 1:J){ 
probsyi = c(1-theta, theta) 
yr2=sample(x=c(0,1),size=1,prob=probsyi) 
if(s[1]==1) yr1=sample(x=c(0,1),size=1,prob=probsyi) else 

yri=sample(x=c(0,1),size=1, prob=c(3,2)* probsyi) 

yr=c(yr1,yr2); yrT=sum(yr); yT=ysT+yrT 
probstheta-c( (1/4)4yT *(3/4)^(4-yT), (3/4)4yT *(1/4)4(4-yT) ) 
theta = sample( x=c(1/4,3/4), size-1, prob= probstheta) 
thetav[j]|-theta; yrTv[j]|-yrT; yTv[j]zyT; yrmat[j,]-yr 
} 

list(thetav=thetav, yrTv=yrTv, yTv=yTv, ybarv=yTv/4, yrmat=yrmat) } 


set.seed(111); J = 10000; thetahatvec=rep(NA,6); ybarhatvec=rep(NA,6) 
res=GS(J=J,s=c(1,2),ys=c(0,0)) 

thetahatvec[1] = mean(resSthetav); ybarhatvec[1] = mean(resSybarv); 
res= GS(J=J,s=c(1,2),ys=c(0,1)) 

thetahatvec[2] = mean(resSthetav); ybarhatvec[2] = mean(resSybarv); 
res= GS(J=J,s=c(1,2),ys=c(1,1)) 

thetahatvec[3] = mean(resSthetav); ybarhatvec[3] = mean(resSybarv); 
res=GS(J=J,s=c(2,3),ys=c(0,0)) 

thetahatvec[4] = mean(resSthetav); ybarhatvec[4] = mean(resSybarv); 
res= GS(J=J,s=c(2,3),ys=c(0,1)) 

thetahatvec[5] = mean(resSthetav); ybarhatvec[5] = mean(resSybarv); 
res= GS(J=J,s=c(2,3),ys=c(1,1)) 

thetahatvec[6] = mean(resSthetav); ybarhatvec[6] = mean(resSybarv); 
thetahatvec # 0.3007 0.4924 0.6997 0.2952 0.4764 0.6925 

# All very close to results in (c) 
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ybarhatvec # 0.1518 0.4929 0.8485 0.1308 0.4719 0.8269 
# All very close to results in (f) 


# (j) Check design bias of predictive mean of ybar if theta=1/4 and y=(0,0,1,1) 
smatrix=matrix(c(1,2, 1,3, 1,4, 2,3, 2,4, 3,4), byrow=T,nrow=6, ncol=2) 
y=c(0,0,1,1); J = 10000; ybarhatsimvzrep(NA,J); set.seed(413) 


for(jin 1:)( | indexsim=sample(1:6,1,prob=c(1,1,1,1,1,1)) 
ssim=smatrix[indexsim,]; yssimz y[ssim] 
ybarhatsimv[j] = ybarhatfun(s=ssim,ys=yssim) } 


est=mean(ybarhatsimv)-0.5; 
ci=est+c(-1,1)*qnorm(0.975)*sd(ybarhatsimv-0.5)/sqrt(J) 
c(est,ci) # -0.01562 -0.01945 -0.01179 Consistent with -0.01463 in (h)(i) 


# Check design bias of predictive mean of ybar if theta=1/4 and y=(1,0,1,1) 
y=c(1,0,1,1); J = 10000; ybarhatsimvzrep(NA,J); set.seed(442) 


for(jin 1:)( | indexsim=sample(1:6,1,prob=c(2,2,2,1,1,1)) 
ssim=smatrix[indexsim,]; yssim= y[ssim] 
ybarhatsimv[j] = ybarhatfun(s=ssim,ys=yssim) } 


est=mean(ybarhatsimv)-0.75; 
ci=est+c(-1,1)*qnorm(0.975)*sd(ybarhatsimv-0.5)/sqrt(J) 
c(est,ci) # -0.06592 -0.06944 -0.06239 Consistent with -0.06531 in (h)(ii) 


# (k) Check mean of predictive mean of finite population mean 
smatrix=matrix(c(1,2, 1,3, 1,4, 2,3, 2,4, 3,4), byrow=T,nrow=6, ncol=2) 
J = 10000; ybarhatsimvzrep(NA,J); set.seed(102); 


for(j in 1:J){ 
thetasim-sample(c(1/4,3/4),1); ysimzrbinom(4,1,thetasim) 
if(ysim[1]==0) indexsim = sample(1:6,1,prob=c(1,1,1,1,1,1)) 
if(ysim[1]==1) indexsim = sample(1:6,1,prob=c(2,2,2,1,1,1)) 
ssim=smatrix[indexsim,]; yssim= ysim[ssim]; 
ybarhatsimv[j]= ybarhatfun(s=ssim,ys=yssim) } 


est = mean(ybarhatsimv); 


ci = est+c(-1,1)*qnorm(0.975)*sd(ybarhatsimv)/sqrt(J) 
c(est,ci) #0.4992 0.4938 0.5047 Consistent with 0.5 
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Below are several probability distributions which feature in this book. The 
purpose of this appendix is to provide a brief guide to the style of notation 
and terminology used throughout. It is not intended to be a comprehensive 
listing. Some of the notation introduced here is repeated in Appendix C. 


B.I The normal distribution 


A random variable (rv) X has the normal distribution with parameters x 
and o° if its probability density function (pdf), or density, has the form 


1 1 N 
fG)- gel 3," p) | nen 


We then write X ~ N(u,o°). To be more explicit, we will sometimes 
write f(x) as f,(x) or fy 


aid legibility, f 


N(u,0) 


T (x) . To avoid subscripting notation and so 


(x) may sometimes be written as f (x, N(j,0^)). 


Likewise for other functions and expressions which contain subscripts. 
If X ~ N(1,0^) then EX = Mode(X) = Median(X) = u and VX = 0°. 


The cumulative distribution function (cdf) of X is 


F(X) = P(X €x) F0) FG, NQuo?) —. f f, oy (dt. 


The (lower) p-quantile of X is the value of x such that F(x)= p. 


Thus the p-quantile of X is the inverse cdf of X. This may also be written 
F(p) = F,'(p) = E... (p) = FIn((p, N(j,0?)). 

If Z ~ N(0,1), we say that Z has the standard normal distribution. The pdf, 

cdf, (lower) p-quantile and upper p-quantile of Z may be denoted by ¢(z), 

(z), ® (p), and Z,-— $ '(1— p), respectively. 
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i : à 2 : 
This notation means that if X ^ N(u, o°), then we may write: 


moa y e), Fa) a [5-2]. Ep) p+ sues 
c c c 


Note: We sometimes use upper and lower case letters interchangeably. 
Thus X ^ N(u,0°) may also be written x ~ N(u,0°). The pdf of a rv 
X when evaluated at c may also be denoted by f(x=c). 


B.2 The gamma distribution 


A random variable X has the gamma distribution with parameters a and b 
if its pdf has the form 
b*x* e —1,,—bx 


~o=—_—————— Ta) sx 1. 


We then write X ~ Gamma(a,b) or X ~ Gam(a,b) or X ~ G(a,b). We may 
also write f,(x) as f(x) or feran (xX) or f o; G(a,b)). 


The cdf of X may be written F, (x) = Fera) (X) = F(x, G(a, b)), and X's 
p-quantile is F, (p) = Fc, (p) =F (p.G(a,b)) = FInv(p,G(a,b)). 
If X ^ G(a,b) then: 

Mode(X)— (a—1)/b ifa>1 


Mode(X)=0 if a<1 
EX =a/b, VX sab 


EX* E (the kth raw moment of X). 


The last result may be proved by writing 
aa- -le —bx a œ pa+k,,a+k—1 „—bx 
EX! -fx x DX T ee px e 
I'(a) b^"TI(a) 5  TI(a-k) 


and noting that the last integral is equal to unity. 


The definition of the gamma distribution involves the gamma function, 


I(k)- JI te^dt. 
0 
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Some properties of the gamma function are as follows: 
I(k)— oo as ko or k 50 
I(k) = (k - T (k ^1) for k>1 
I(k) 2 (k 21)! if k € (,2,3,...) (with 0!=1) 
T/2)2 Ja. 


Note: There is an alternative definition of the gamma distribution, 
whereby X ^ G(a,b) means f(x)—b "x^ e "" /T(a), x» 0, so that 
EX =ab. This alternative definition is not used in this book. 


B.3 The exponential distribution 


If X ^ G(1,b) then X has the exponential distribution with parameter b, 
and we write X ~ Exponential(b) or X ~ Expo(b) . 


Note: We do not write X ^ Exp(b) because this could more easily be 
confused with X = exp(b) =e” (where exp is the exponential function). 


The pdf of X, namely f(x)—be", x>0, may also be written as 
fewo) (X) or f (x, Expo(b)). 

If X ^ Expo(1), we say that X has the standard exponential distribution. 
B.4 The chi-squared distribution 


If X ~G(m/2,1/2) then X has the chi-squared distribution with 
parameter m (called the degrees of freedom, abbreviated dof). 


We then write X ~ y*(m) or X ~ Chisq(m) , and denote the pdf of X by 
f... CO or fs Chisq(m)). 


X 
The upper p-quantile of the 7^ (m) distribution may be written 


x,(m)-F.. (-p)-FInv(- p,Chisq(m)). 


x (m) 
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A useful result is that if Y 2rX , where X ~ Gamma(m/ 2,r/2), then 


Y ~G(m/2,1/2)~ y*(m). This result can be proved easily using the 
transformation rule, as follows: 


mao, 
dx| ^ Bi e? 
dy r r 


af 
Note: The symbol œ here denotes ‘proportionality with respect to y’. 


f(y= f(x% 


iz 
The statement goch means g =cxh , where c is a constant that does 


t Ip 


not depend on t. E.g. if g=5t°r°, we may write: goct^ , gor’, 
tan n t 
gc tr^, gæt, g Y r^, etc. By default, g(t) oc ? means g(t)oct^, 


and g(t|u) oc ? means g(t|u)oct (not g(t|u) « t°). 


B.5 The inverse gamma distribution 


If X ^ G(a,b), then Y =1/ X has the inverse gamma distribution with 
parameters a and b. In that case, we write Y ^ InverseGamma(a,b) or 
Y ^ IGam(a,b) or Y ~ IG(a,b). 


By the transformation rule, the pdf of Y is 
dx b? 1/ 6-3 y By) b? FUP a by 
f(y) = re 2T E e DL Y 
dy D(a) y D(a) 
which may also be written fi¢,,,)(y) or f(y,IG(a,b)). 


E 


y 2» 0, 


Some other properties of Y are as follows: 
EY —b/(a—1) ifa» 1, EY =o if a€1 


VY —b'/((a—1)(a—2)) ifa>2, Mode(Y)=b/(a+1). 
B.6 The t distribution 


A random variable X has the t distribution with parameter m if 


-l(m41 
. I((m-1)/2) x’ | 2 B 
f= mD ET 1 < y= BO SX DO, 


In that case, we write X ^t(m) and denote the density of X by fem (x) 
or f (x,t(m)). The cdf of X is denoted Ej,,(x) or F(x,t(m)), and the 
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upper p-quantile may be written t, (m) = Es p)=FInv(1- p,t(m)). 


We call m the degrees of freedom parameter. 


An equivalent definition of the t distribution is as follows. If Z ^ N(0,1), 
Y ^ Y'(m) and Z LY , then X =Z/VJY/m~t(m). 


Note: The symbol | here denotes independence. Thus, the statement 
A | B means that A and B are independent random variables. Likewise, 
(A.L B|C) means that A and B are independent conditional on C. 


B.7 The F distribution 


U/a 


Suppose that U ^ y*(a), W ~ Y'(b) and U LW. Then X = Tm has 


the F distribution with parameters a and b. We then write X ^ F(a,b). 
The pdf and cdf of X (both omitted here) may be denoted f,,,,(x) and 


Fpa») (X), respectively. We call a the numerator degrees of freedom and 


b the denominator degrees of freedom. The upper p-quantile of X may be 
denoted as F, (a,b) or F;,,,(1— p) or Finv(1— p, F(a,b)). 


B.8 The (continuous) uniform distribution 


A random variable X has the (continuous) uniform distribution with 
parameters a and b if its pdf is f(x) 21/(b—a), a<x<b. 


We then write X ^ U(a,b) and f(x) = fj,,,0X) = f(x,U(a,b)). 
The cdf of X is Ej, (x) = F(x,U(a,b)) = (x-a)/(b-a), a«x«b. 


The mean and variance of X are (a+b)/2 and (b-a) /12. 
B.9 The discrete uniform distribution 


A random variable X has the discrete uniform distribution with parameters 
0,4, 0, if its densityis f(x) 31/ K, x=4d,,...,.d,. 


We then write X ^ DU(a,,...,a,). The density f(x) may also be written 


D 
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Equivalently, we may describe X as having the discrete uniform 
distribution with parameter d = (d,,...,a.) (a vector). In that case, we may 
write X ~ DU(a) and denote f(x) by f5,,,(x) or f (x, DU(a)). 


Note: Because X here is discrete, f(x) may more aptly be called the 
probability mass function (pmf) of X. But for simplicity, we usually use 
the term probability density function (pdf) or density in reference to any 
type of random variable (continuous, discrete or mixed). 


B.10 The binomial distribution 


A rv X has the binomial distribution with parameters n and p if its density 
has the form 


f(x)= 9 pü-p) ,xUL.jn9. 


We then write X ^ Bin(n, p). The density f (x) may also be denoted by 
fa, (X) OF f(x, Bin(n, p)). The mean and variance of X are np and 
np(1— p) . We call n the number of trials and p the probability of success 
(equivalently, the binomial parameter or the binomial proportion). 


B.I | The Bernoulli distribution 


If X ^ Bin(1, p) then we say that X has the Benoulli distribution with 
parameter p. We then write X ~ Bernoulli(p) or X ~ Bern(p). 


B.12 The geometric distribution 


A random variable is said to have the geometric distribution with 
parameter p if its pdf has the form 


fGosd-p p, x=1,2,3,... 


We then write X ~ Geo(p). The pdf of X may be denoted by f... (X) 


or f(x,Geo(p)). The mean and variance of X are 1/ p and (1- p)/ p°. 
The cdf of X is given by 
Foocotp)(X) = F(x, Geo(p)) = P(X €x) -1- 07 p, x 712,3... 


672 


Abbreviations and Acronyms 


Below are some of the abbreviations and acronyms used in this book. The 
list may not be comprehensive. Some of the expressions listed have more 
than one meaning, depending on the context. 


ACF 
AELF 
AR 
ARMA 


B 

Bern 

Beta 

BF 

Bin, Binom 
BUGS 


C, Cov 
cdf 
CDR 
Chisq 
CI 
CNR 
CPDI 


autocorrelation function 

absolute error loss function 
autoregressive (process); acceptance rate 
autoregressive moving average (process) 


beta function; bias 

Bernoulli distribution 

beta distribution 

Bayes factor 

binomial distribution 

Bayesian inference Using Gibbs Sampling (software 
environment for performing MCMC) 


covariance operator 

cumulative distribution function (same as df) 
central density region 

chi-squared distribution (equivalent to 7^) 
confidence interval 

conditional Newton-Raphson (algorithm) 
central posterior (or predictive) density interval 
central posterior (or predictive) density region 
continuous 


data 

data augmentation (algorithm) 
distribution function (same as cdf) 
degrees of freedom 

distribution 

discrete uniform distribution 
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id 

IELF 

IG, IGam 
iid 

ind, indep 


expectation operator 

Euler's number (2.71828) 
Expectation-Conditional-Maximisation (algorithm) 
error loss function 

Expectation-Maximisation (algorithm) 
Expectation Step (in EM algorithm) 

exponential function (e raised to a power) 
exponential distribution 


F distribution; (cumulative) distribution function 

pdf or pmf (same as p); finite population correction factor 
frequentist coverage probability 

inverse distribution function (equivalent to F ' ) 

finite population correction (factor) 


gamma distribution (not to be confused with the gamma 
function, which is denoted by the Greek letter T ) 
geometric distribution 

generalised linear model 

Gibbs sampler/sampling 


highest posterior (or predictive) density interval 
highest posterior (or predictive) density region 
hypergeometric distribution 


standard indicator function; vector of sample inclusion 
indicators (or counters); Fisher information 

identically distributed (not necessarily independent) 
indicator error loss function 

inverse gamma distribution 

independent and identically distributed (as) 
independent (not necessarily identically distributed) 


Monte Carlo sample size 


loss function; lower bound; ordered sample (vector of the 
labels of selected units in the order that they are sampled) 
law of iterated covariance: 

C(X,Y)- EC(X,Y|Z)+C{E(X | Z,E(Y | Z)} 
law of iterated expectation: EX = EE(X |Z) 
law of iterated variance: VX = EV(X | Z)+VE(X |Z) 
natural logarithm (to base e) 


P, Pr, Prob 
p 

PACF 

PDF 

pdf 


PEL 

pmf 

Poi 

POO 

pop 

post 
ppp-value 
pr, prob 
pred 

PRO 


pt 
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nonsample size (m= N -n ) 

moving average (process); Metropolis algorithm 
mean absolute deviation; finite population mean 
absolute deviation about the superpopulation mean 
maximum/maximise 

Monte Carlo (method); Markov chain 

Markov chain Monte Carlo (method) 
Metropolis-Hastings (algorithm) 
minimum/minimise 

maximum likelihood (method) 

maximum likelihood estimate/estimator/estimation 
method of moments estimate/estimator/estimation 
Maximisation Step (in EM algorithm) 


normal (or Gaussian) distribution; finite population size 
sample size 

normal-gamma (Bayesian model) 

normal-normal (Bayesian model) 
normal-normal-gamma (Bayesian model) 
Newton-Raphson (algorithm) 


probability function 

binomial proportion; pdf or pmf (same as f) 

partial autocorrelation function 

portable document format (file) 

probability density function (used for all types of rvs: 
continuous, discrete and mixed); used instead of pmf 
posterior expected loss (function) 

probability mass function (rarely used; see pdf) 
Poisson distribution 

posterior odds 

population 

posterior 

posterior predictive p-value 

probability 

predictive/prediction/predictor 

prior odds 

point 


quantity of interest; quantile function; Q-function (in the 


EM algorithm) 
quadratic error loss function 
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R 
R 


RB 


SD, sd 
SE, se 
SMA 

SRS 
SRSWOR 
SRSWR 
st 


TIAP 


V, Var 


WinBUGS 
wrt 


X 
X 
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R (software environment for statistical computing) 
relative bias; risk function (not to be confused with  , 
which denotes the whole real line) 

Bayes risk; nonsample (vector of the labels of the units 
that are not sampled) 

Rao-Blackwell (estimate/estimator/estimation or method) 
random variable 


sample standard deviation; sample (vector of the labels of 
the units that are sampled) 

standard deviation 

standard error (estimate of standard deviation) 

seasonal moving average (process) 

simple random sampling (with or without replacement) 
simple random sampling without replacement 

simple random sampling with replacement 

such that 


random variable with the t distribution 
t distribution; upper quantile of the t distribution 
Total International Airline Passengers (time series) 


(continuous) uniform distribution; random variable with 
the standard uniform distribution; upper bound 


variance operator 


BUGS for Microsoft Windows (see BUGS) 
with respect to 


finite population covariate vector (of N values) 
sample covariate vector (of n values) 


random variable or vector of random variables; 
finite population vector (of N values) 

realised value of a random variable or vector of 
random variables; sample vector (of n values); 
sometimes used interchangeably with Y 


standard normal random variable 
upper quantile of the standard normal distribution 
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